The residual adds and ReLU MLP are exact, while the layernorm example shows where a general square root becomes named.
Through the block
The first residual add gives (2,0). In this pass, layernorm has variance 1 and std 1, so it stays exact.
residual one = ( 2 , 0 ) \operatorname{residual}_{\text{one}}=(2,0) residual one = ( 2 , 0 )
Through the block Exact block path plus the general layernorm boundary. Through the block Exact block path plus the general layernorm boundary. tiny transformer exact-or-named forward discrete spine exact; softmax/layernorm sqrt become named only at the boundary input: a b; E[a]=(1,0), E[b]=(0,1), E[c]=(1,1); P0=(0,0), P1=(1,0) weights: Wq=Wk=Wv=I; MLP=I+ReLU+I; gamma=(1,1), beta=(0,0); unembed tied to E position 0: fully exact path x0=(1,0); Q0=K0=V0=(1,0) score S00=1; softmax=[1] exact attn0=(1,0); residual1=(2,0) ln1 mean=1; centered=(1,-1); var=1; std=1 ln1 output=(1,-1) MLP ReLU=(1,0); mlp=(1,0) residual2=(2,-1) ln2 mean=1/2; centered=(3/2,-3/2); var=9/4; std=3/2 ln2 output=(1,-1) logits: a=1, b=-1, c=0 argmax=a; output token=a position 1: named softmax boundary x1=(1,1); Q1=(1,1) K0=(1,0); K1=(1,1) scores=[1, 2] softmax=[e^1/(e^1+e^2), e^2/(e^1+e^2)] named after multi-entry softmax ordered pipeline tokens -> embed -> +pos -> attention -> +residual -> layernorm -> MLP -> +residual -> layernorm -> unembed -> logits -> argmax -> output token pos0 remains exact; pos1 stops at named softmax layernorm sqrt boundary this transformer's 2-D layernorms are exact here; in general layernorm's square root is named named 3-vector v=(1,2,3); mean=2 centered=(-1,0,1) var=2/3; std=√(2/3); register=named normalized=(-1,0,1)/√(2/3) degenerate exact 3-vector v=(2,2,2); mean=2 centered=(0,0,0) var=0; std=0; register=exact normalized=zero-variance exact std; normalization not divided
Layernorm's square root
In general, layernorm crosses a square-root boundary. The rendered three-vector v=(1,2,3) has variance 2/3 and std named as the square root of that fraction.
var = 2 / 3 , std = 2 / 3 \operatorname{var}=2/3,\quad \operatorname{std}=\sqrt{2/3} var = 2/3 , std = 2/3
Through the block Exact block path plus the general layernorm boundary. Through the block Exact block path plus the general layernorm boundary. tiny transformer exact-or-named forward discrete spine exact; softmax/layernorm sqrt become named only at the boundary input: a b; E[a]=(1,0), E[b]=(0,1), E[c]=(1,1); P0=(0,0), P1=(1,0) weights: Wq=Wk=Wv=I; MLP=I+ReLU+I; gamma=(1,1), beta=(0,0); unembed tied to E position 0: fully exact path x0=(1,0); Q0=K0=V0=(1,0) score S00=1; softmax=[1] exact attn0=(1,0); residual1=(2,0) ln1 mean=1; centered=(1,-1); var=1; std=1 ln1 output=(1,-1) MLP ReLU=(1,0); mlp=(1,0) residual2=(2,-1) ln2 mean=1/2; centered=(3/2,-3/2); var=9/4; std=3/2 ln2 output=(1,-1) logits: a=1, b=-1, c=0 argmax=a; output token=a position 1: named softmax boundary x1=(1,1); Q1=(1,1) K0=(1,0); K1=(1,1) scores=[1, 2] softmax=[e^1/(e^1+e^2), e^2/(e^1+e^2)] named after multi-entry softmax ordered pipeline tokens -> embed -> +pos -> attention -> +residual -> layernorm -> MLP -> +residual -> layernorm -> unembed -> logits -> argmax -> output token pos0 remains exact; pos1 stops at named softmax layernorm sqrt boundary this transformer's 2-D layernorms are exact here; in general layernorm's square root is named named 3-vector v=(1,2,3); mean=2 centered=(-1,0,1) var=2/3; std=√(2/3); register=named normalized=(-1,0,1)/√(2/3) degenerate exact 3-vector v=(2,2,2); mean=2 centered=(0,0,0) var=0; std=0; register=exact normalized=zero-variance exact std; normalization not divided
Exact MLP and second residual
ReLU keeps the positive coordinate and zeroes the negative coordinate, giving MLP output (1,0). The second residual becomes (2,-1).
MLP = ( 1 , 0 ) , r two = ( 2 , − 1 ) \operatorname{MLP}=(1,0),\quad r_{\text{two}}=(2,-1) MLP = ( 1 , 0 ) , r two = ( 2 , − 1 )
Through the block Exact block path plus the general layernorm boundary. Through the block Exact block path plus the general layernorm boundary. tiny transformer exact-or-named forward discrete spine exact; softmax/layernorm sqrt become named only at the boundary input: a b; E[a]=(1,0), E[b]=(0,1), E[c]=(1,1); P0=(0,0), P1=(1,0) weights: Wq=Wk=Wv=I; MLP=I+ReLU+I; gamma=(1,1), beta=(0,0); unembed tied to E position 0: fully exact path x0=(1,0); Q0=K0=V0=(1,0) score S00=1; softmax=[1] exact attn0=(1,0); residual1=(2,0) ln1 mean=1; centered=(1,-1); var=1; std=1 ln1 output=(1,-1) MLP ReLU=(1,0); mlp=(1,0) residual2=(2,-1) ln2 mean=1/2; centered=(3/2,-3/2); var=9/4; std=3/2 ln2 output=(1,-1) logits: a=1, b=-1, c=0 argmax=a; output token=a position 1: named softmax boundary x1=(1,1); Q1=(1,1) K0=(1,0); K1=(1,1) scores=[1, 2] softmax=[e^1/(e^1+e^2), e^2/(e^1+e^2)] named after multi-entry softmax ordered pipeline tokens -> embed -> +pos -> attention -> +residual -> layernorm -> MLP -> +residual -> layernorm -> unembed -> logits -> argmax -> output token pos0 remains exact; pos1 stops at named softmax layernorm sqrt boundary this transformer's 2-D layernorms are exact here; in general layernorm's square root is named named 3-vector v=(1,2,3); mean=2 centered=(-1,0,1) var=2/3; std=√(2/3); register=named normalized=(-1,0,1)/√(2/3) degenerate exact 3-vector v=(2,2,2); mean=2 centered=(0,0,0) var=0; std=0; register=exact normalized=zero-variance exact std; normalization not divided