The final normalized vector unembeds to exact logits. Greedy argmax then selects the next token exactly.

highlighted = computed this step

Final layernorm

The second layernorm has variance 9/4 and std 3/2. Its exact output is (1,-1).

LNtwo=(1,1)\operatorname{LN}_{\text{two}}=(1,-1)
The next tokenExact final layernorm, logits, and argmax.The next tokenExact final layernorm, logits, and argmax.tiny transformer exact-or-named forwarddiscrete spine exact; softmax/layernorm sqrt become named only at the boundaryinput: a b; E[a]=(1,0), E[b]=(0,1), E[c]=(1,1); P0=(0,0), P1=(1,0)weights: Wq=Wk=Wv=I; MLP=I+ReLU+I; gamma=(1,1), beta=(0,0); unembed tied to Eposition 0: fully exact pathx0=(1,0); Q0=K0=V0=(1,0)score S00=1; softmax=[1] exactattn0=(1,0); residual1=(2,0)ln1 mean=1; centered=(1,-1); var=1; std=1ln1 output=(1,-1)MLP ReLU=(1,0); mlp=(1,0)residual2=(2,-1)ln2 mean=1/2; centered=(3/2,-3/2); var=9/4; std=3/2ln2 output=(1,-1)logits: a=1, b=-1, c=0argmax=a; output token=aposition 1: named softmax boundaryx1=(1,1); Q1=(1,1)K0=(1,0); K1=(1,1)scores=[1, 2]softmax=[e^1/(e^1+e^2), e^2/(e^1+e^2)]named after multi-entry softmaxordered pipelinetokens -> embed -> +pos -> attention -> +residual-> layernorm -> MLP -> +residual -> layernorm-> unembed -> logits -> argmax -> output tokenpos0 remains exact; pos1 stops at named softmax

Tied unembed

The tied unembed dots that vector with the token embeddings. The logits are a=1, b=-1, c=0.

a=1,b=1,c=0\ell_a=1,\ell_b=-1,\ell_c=0
The next tokenExact final layernorm, logits, and argmax.The next tokenExact final layernorm, logits, and argmax.tiny transformer exact-or-named forwarddiscrete spine exact; softmax/layernorm sqrt become named only at the boundaryinput: a b; E[a]=(1,0), E[b]=(0,1), E[c]=(1,1); P0=(0,0), P1=(1,0)weights: Wq=Wk=Wv=I; MLP=I+ReLU+I; gamma=(1,1), beta=(0,0); unembed tied to Eposition 0: fully exact pathx0=(1,0); Q0=K0=V0=(1,0)score S00=1; softmax=[1] exactattn0=(1,0); residual1=(2,0)ln1 mean=1; centered=(1,-1); var=1; std=1ln1 output=(1,-1)MLP ReLU=(1,0); mlp=(1,0)residual2=(2,-1)ln2 mean=1/2; centered=(3/2,-3/2); var=9/4; std=3/2ln2 output=(1,-1)logits: a=1, b=-1, c=0argmax=a; output token=aposition 1: named softmax boundaryx1=(1,1); Q1=(1,1)K0=(1,0); K1=(1,1)scores=[1, 2]softmax=[e^1/(e^1+e^2), e^2/(e^1+e^2)]named after multi-entry softmaxordered pipelinetokens -> embed -> +pos -> attention -> +residual-> layernorm -> MLP -> +residual -> layernorm-> unembed -> logits -> argmax -> output tokenpos0 remains exact; pos1 stops at named softmax

The next token

Greedy decoding takes the largest logit with lowest-index tie-break. The largest logit is a, so the output token is a.

argmax{a:1,b:1,c:0}=a\operatorname{argmax}\{a:1,b:-1,c:0\}=a
The next tokenExact final layernorm, logits, and argmax.The next tokenExact final layernorm, logits, and argmax.tiny transformer exact-or-named forwarddiscrete spine exact; softmax/layernorm sqrt become named only at the boundaryinput: a b; E[a]=(1,0), E[b]=(0,1), E[c]=(1,1); P0=(0,0), P1=(1,0)weights: Wq=Wk=Wv=I; MLP=I+ReLU+I; gamma=(1,1), beta=(0,0); unembed tied to Eposition 0: fully exact pathx0=(1,0); Q0=K0=V0=(1,0)score S00=1; softmax=[1] exactattn0=(1,0); residual1=(2,0)ln1 mean=1; centered=(1,-1); var=1; std=1ln1 output=(1,-1)MLP ReLU=(1,0); mlp=(1,0)residual2=(2,-1)ln2 mean=1/2; centered=(3/2,-3/2); var=9/4; std=3/2ln2 output=(1,-1)logits: a=1, b=-1, c=0argmax=a; output token=aposition 1: named softmax boundaryx1=(1,1); Q1=(1,1)K0=(1,0); K1=(1,1)scores=[1, 2]softmax=[e^1/(e^1+e^2), e^2/(e^1+e^2)]named after multi-entry softmaxordered pipelinetokens -> embed -> +pos -> attention -> +residual-> layernorm -> MLP -> +residual -> layernorm-> unembed -> logits -> argmax -> output tokenpos0 remains exact; pos1 stops at named softmax