Softmax is the first non-exact operation in this book. The single-entry row pins to one, while multi-entry rows are shown as named exponential expressions.

highlighted = computed this step

Softmax is the named boundary

Softmax turns an unmasked score row into weights using exponentials. The scores and mask are exact, but the exponential normalization is named instead of decimalized.

softmax(s)i=esijesj\operatorname{softmax}(s)_i=\frac{e^{s_i}}{\sum_j e^{s_j}}
Softmax named boundaryOnly the single-entry row pins to one; other rows stay named.Softmax named boundaryOnly the single-entry row pins to one; other rows stay named.exact scores, exact causal mask, named softmaxdisplayed Q,K integer vectors are the score source; causal mask keeps j<=iQ1=(1,0); Q2=(0,1); Q3=(1,1)K1=(1,0); K2=(0,1); K3=(1,1)score S_ij=Qi·Kj with causal maskK1K2K3Q1Q2Q31maskedmasked01masked112row softmax over unmasked scoresrow 1: scores [1]weights 1row 2: scores [0,1]weights e^0/(e^0+e^1), e^1/(e^0+e^1)row 3: scores [1,1,2]weights e^1/(e^1+e^1+e^2), e^1/(e^1+e^1+e^2), e^2/(e^1+e^1+e^2)softmax is named: no decimal attention weights are pinned

The one exact softmax row

Row 1 has one unmasked score, so its only weight is exactly 1. Row 2 is named as e^0 over e^0 plus e^1, and e^1 over the same denominator.

row 1:1,row 2:e0e0+e1,e1e0+e1\text{row }1: 1,\quad \text{row }2: \frac{e^{0}}{e^{0}+e^{1}},\frac{e^{1}}{e^{0}+e^{1}}
Softmax named boundaryOnly the single-entry row pins to one; other rows stay named.Softmax named boundaryOnly the single-entry row pins to one; other rows stay named.exact scores, exact causal mask, named softmaxdisplayed Q,K integer vectors are the score source; causal mask keeps j<=iQ1=(1,0); Q2=(0,1); Q3=(1,1)K1=(1,0); K2=(0,1); K3=(1,1)score S_ij=Qi·Kj with causal maskK1K2K3Q1Q2Q31maskedmasked01masked112row softmax over unmasked scoresrow 1: scores [1]weights 1row 2: scores [0,1]weights e^0/(e^0+e^1), e^1/(e^0+e^1)row 3: scores [1,1,2]weights e^1/(e^1+e^1+e^2), e^1/(e^1+e^1+e^2), e^2/(e^1+e^1+e^2)softmax is named: no decimal attention weights are pinned

Summary

Softmax is the exact boundary of this book. The diagram renders its algebraic structure and rejects any float weight.

named softmax; no decimal weights\text{named softmax; no decimal weights}
Softmax named boundaryOnly the single-entry row pins to one; other rows stay named.Softmax named boundaryOnly the single-entry row pins to one; other rows stay named.exact scores, exact causal mask, named softmaxdisplayed Q,K integer vectors are the score source; causal mask keeps j<=iQ1=(1,0); Q2=(0,1); Q3=(1,1)K1=(1,0); K2=(0,1); K3=(1,1)score S_ij=Qi·Kj with causal maskK1K2K3Q1Q2Q31maskedmasked01masked112row softmax over unmasked scoresrow 1: scores [1]weights 1row 2: scores [0,1]weights e^0/(e^0+e^1), e^1/(e^0+e^1)row 3: scores [1,1,2]weights e^1/(e^1+e^1+e^2), e^1/(e^1+e^1+e^2), e^2/(e^1+e^1+e^2)softmax is named: no decimal attention weights are pinned