Attention, Exactly

Attention is exact until a multi-entry softmax appears. The one-entry row pins to one; the two-entry row is named.

highlighted = computed this step

The Q, K, and V maps are identity matrices, so the displayed vectors pass through unchanged. Position zero can attend only to itself.

Q=K=V=I

One-entry softmax

Position zero has score 1 and a one-entry softmax, so the attention weight is exactly 1. Therefore attn zero is (1,0).

S_{\text{zero,zero}}=1,\quad \operatorname{softmax}=[1]

Position one has two unmasked scores, 1 and 2. Its softmax is named with exponentials, not decimalized.

\operatorname{softmax}([1,2])=\text{named}