The full grid combines scores, mask, and named softmax in one validated scene. It keeps the exact and named registers visible at the same time.

highlighted = computed this step

The attention grid

The full grid shows the three pieces together: exact dot-product scores, the causal mask, and named row softmax.

scoresmasknamed softmax\text{scores}\rightarrow\text{mask}\rightarrow\text{named softmax}
Attention gridExact scores, exact mask, named row softmax.Attention gridExact scores, exact mask, named row softmax.exact scores, exact causal mask, named softmaxdisplayed Q,K integer vectors are the score source; causal mask keeps j<=iQ1=(1,0); Q2=(0,1); Q3=(1,1)K1=(1,0); K2=(0,1); K3=(1,1)score S_ij=Qi·Kj with causal maskK1K2K3Q1Q2Q31maskedmasked01masked112row softmax over unmasked scoresrow 1: scores [1]weights 1row 2: scores [0,1]weights e^0/(e^0+e^1), e^1/(e^0+e^1)row 3: scores [1,1,2]weights e^1/(e^1+e^1+e^2), e^1/(e^1+e^1+e^2), e^2/(e^1+e^1+e^2)softmax is named: no decimal attention weights are pinned

Reading the grid

The first row has weight 1 because it has only one allowed key. The other rows keep their exponential forms so no softmax float is shown.

row 1 weight 1\text{row }1\text{ weight }1
Attention gridExact scores, exact mask, named row softmax.Attention gridExact scores, exact mask, named row softmax.exact scores, exact causal mask, named softmaxdisplayed Q,K integer vectors are the score source; causal mask keeps j<=iQ1=(1,0); Q2=(0,1); Q3=(1,1)K1=(1,0); K2=(0,1); K3=(1,1)score S_ij=Qi·Kj with causal maskK1K2K3Q1Q2Q31maskedmasked01masked112row softmax over unmasked scoresrow 1: scores [1]weights 1row 2: scores [0,1]weights e^0/(e^0+e^1), e^1/(e^0+e^1)row 3: scores [1,1,2]weights e^1/(e^1+e^1+e^2), e^1/(e^1+e^1+e^2), e^2/(e^1+e^1+e^2)softmax is named: no decimal attention weights are pinned

Summary

The diagram is self-contained: Q and K are visible, scores recompute from them, mask cells are structural, and softmax stays named.

displayed sourcevalidated grid\text{displayed source}\rightarrow\text{validated grid}
Attention gridExact scores, exact mask, named row softmax.Attention gridExact scores, exact mask, named row softmax.exact scores, exact causal mask, named softmaxdisplayed Q,K integer vectors are the score source; causal mask keeps j<=iQ1=(1,0); Q2=(0,1); Q3=(1,1)K1=(1,0); K2=(0,1); K3=(1,1)score S_ij=Qi·Kj with causal maskK1K2K3Q1Q2Q31maskedmasked01masked112row softmax over unmasked scoresrow 1: scores [1]weights 1row 2: scores [0,1]weights e^0/(e^0+e^1), e^1/(e^0+e^1)row 3: scores [1,1,2]weights e^1/(e^1+e^1+e^2), e^1/(e^1+e^1+e^2), e^2/(e^1+e^1+e^2)softmax is named: no decimal attention weights are pinned