Softmax is the first non-exact operation in this book. The single-entry row pins to one, while multi-entry rows are shown as named exponential expressions.
Softmax is the named boundary
Softmax turns an unmasked score row into weights using exponentials. The scores and mask are exact, but the exponential normalization is named instead of decimalized.
softmax ( s ) i = e s i ∑ j e s j \operatorname{softmax}(s)_i=\frac{e^{s_i}}{\sum_j e^{s_j}} softmax ( s ) i = ∑ j e s j e s i
Softmax named boundary Only the single-entry row pins to one; other rows stay named. Softmax named boundary Only the single-entry row pins to one; other rows stay named. exact scores, exact causal mask, named softmax displayed Q,K integer vectors are the score source; causal mask keeps j<=i Q1=(1,0); Q2=(0,1); Q3=(1,1) K1=(1,0); K2=(0,1); K3=(1,1) score S_ij=Qi·Kj with causal mask K1 K2 K3 Q1 Q2 Q3 1 masked masked 0 1 masked 1 1 2 row softmax over unmasked scores row 1: scores [1] weights 1 row 2: scores [0,1] weights e^0/(e^0+e^1), e^1/(e^0+e^1) row 3: scores [1,1,2] weights e^1/(e^1+e^1+e^2), e^1/(e^1+e^1+e^2), e^2/(e^1+e^1+e^2) softmax is named: no decimal attention weights are pinned
The one exact softmax row
Row 1 has one unmasked score, so its only weight is exactly 1. Row 2 is named as e^0 over e^0 plus e^1, and e^1 over the same denominator.
row 1 : 1 , row 2 : e 0 e 0 + e 1 , e 1 e 0 + e 1 \text{row }1: 1,\quad \text{row }2: \frac{e^{0}}{e^{0}+e^{1}},\frac{e^{1}}{e^{0}+e^{1}} row 1 : 1 , row 2 : e 0 + e 1 e 0 , e 0 + e 1 e 1
Softmax named boundary Only the single-entry row pins to one; other rows stay named. Softmax named boundary Only the single-entry row pins to one; other rows stay named. exact scores, exact causal mask, named softmax displayed Q,K integer vectors are the score source; causal mask keeps j<=i Q1=(1,0); Q2=(0,1); Q3=(1,1) K1=(1,0); K2=(0,1); K3=(1,1) score S_ij=Qi·Kj with causal mask K1 K2 K3 Q1 Q2 Q3 1 masked masked 0 1 masked 1 1 2 row softmax over unmasked scores row 1: scores [1] weights 1 row 2: scores [0,1] weights e^0/(e^0+e^1), e^1/(e^0+e^1) row 3: scores [1,1,2] weights e^1/(e^1+e^1+e^2), e^1/(e^1+e^1+e^2), e^2/(e^1+e^1+e^2) softmax is named: no decimal attention weights are pinned
Summary
Softmax is the exact boundary of this book. The diagram renders its algebraic structure and rejects any float weight.
named softmax; no decimal weights \text{named softmax; no decimal weights} named softmax; no decimal weights
Softmax named boundary Only the single-entry row pins to one; other rows stay named. Softmax named boundary Only the single-entry row pins to one; other rows stay named. exact scores, exact causal mask, named softmax displayed Q,K integer vectors are the score source; causal mask keeps j<=i Q1=(1,0); Q2=(0,1); Q3=(1,1) K1=(1,0); K2=(0,1); K3=(1,1) score S_ij=Qi·Kj with causal mask K1 K2 K3 Q1 Q2 Q3 1 masked masked 0 1 masked 1 1 2 row softmax over unmasked scores row 1: scores [1] weights 1 row 2: scores [0,1] weights e^0/(e^0+e^1), e^1/(e^0+e^1) row 3: scores [1,1,2] weights e^1/(e^1+e^1+e^2), e^1/(e^1+e^1+e^2), e^2/(e^1+e^1+e^2) softmax is named: no decimal attention weights are pinned