This capstone traces a complete tiny transformer forward pass. It shows the entire exact-or-named pipeline before focusing on each stage.
The whole transformer, by hand
This book traces a complete tiny transformer forward pass and the next token by exact arithmetic. The configuration is vocab a,b,c, dim 2, one head, one block, and tied unembed.
tiny transformer: dim 2 \text{tiny transformer: dim }2 tiny transformer: dim 2
Tiny transformer forward Exact spine with named boundaries. Tiny transformer forward Exact spine with named boundaries. tiny transformer exact-or-named forward discrete spine exact; softmax/layernorm sqrt become named only at the boundary input: a b; E[a]=(1,0), E[b]=(0,1), E[c]=(1,1); P0=(0,0), P1=(1,0) weights: Wq=Wk=Wv=I; MLP=I+ReLU+I; gamma=(1,1), beta=(0,0); unembed tied to E position 0: fully exact path x0=(1,0); Q0=K0=V0=(1,0) score S00=1; softmax=[1] exact attn0=(1,0); residual1=(2,0) ln1 mean=1; centered=(1,-1); var=1; std=1 ln1 output=(1,-1) MLP ReLU=(1,0); mlp=(1,0) residual2=(2,-1) ln2 mean=1/2; centered=(3/2,-3/2); var=9/4; std=3/2 ln2 output=(1,-1) logits: a=1, b=-1, c=0 argmax=a; output token=a position 1: named softmax boundary x1=(1,1); Q1=(1,1) K0=(1,0); K1=(1,1) scores=[1, 2] softmax=[e^1/(e^1+e^2), e^2/(e^1+e^2)] named after multi-entry softmax ordered pipeline tokens -> embed -> +pos -> attention -> +residual -> layernorm -> MLP -> +residual -> layernorm -> unembed -> logits -> argmax -> output token pos0 remains exact; pos1 stops at named softmax
Ordered pipeline
The ordered flow is tokens, embeddings, position table, attention, residual add, layernorm, MLP, residual add, layernorm, unembed, logits, argmax.
tokens → logits → argmax \text{tokens}\rightarrow\text{logits}\rightarrow\operatorname{argmax} tokens → logits → argmax
Tiny transformer forward Exact spine with named boundaries. Tiny transformer forward Exact spine with named boundaries. tiny transformer exact-or-named forward discrete spine exact; softmax/layernorm sqrt become named only at the boundary input: a b; E[a]=(1,0), E[b]=(0,1), E[c]=(1,1); P0=(0,0), P1=(1,0) weights: Wq=Wk=Wv=I; MLP=I+ReLU+I; gamma=(1,1), beta=(0,0); unembed tied to E position 0: fully exact path x0=(1,0); Q0=K0=V0=(1,0) score S00=1; softmax=[1] exact attn0=(1,0); residual1=(2,0) ln1 mean=1; centered=(1,-1); var=1; std=1 ln1 output=(1,-1) MLP ReLU=(1,0); mlp=(1,0) residual2=(2,-1) ln2 mean=1/2; centered=(3/2,-3/2); var=9/4; std=3/2 ln2 output=(1,-1) logits: a=1, b=-1, c=0 argmax=a; output token=a position 1: named softmax boundary x1=(1,1); Q1=(1,1) K0=(1,0); K1=(1,1) scores=[1, 2] softmax=[e^1/(e^1+e^2), e^2/(e^1+e^2)] named after multi-entry softmax ordered pipeline tokens -> embed -> +pos -> attention -> +residual -> layernorm -> MLP -> +residual -> layernorm -> unembed -> logits -> argmax -> output token pos0 remains exact; pos1 stops at named softmax
Summary
The capstone combines the exact forward pass, exact ReLU block, named softmax boundary, named layernorm square-root boundary, and greedy output token.
exact spine plus named boundaries \text{exact spine plus named boundaries} exact spine plus named boundaries
Tiny transformer forward Exact spine with named boundaries. Tiny transformer forward Exact spine with named boundaries. tiny transformer exact-or-named forward discrete spine exact; softmax/layernorm sqrt become named only at the boundary input: a b; E[a]=(1,0), E[b]=(0,1), E[c]=(1,1); P0=(0,0), P1=(1,0) weights: Wq=Wk=Wv=I; MLP=I+ReLU+I; gamma=(1,1), beta=(0,0); unembed tied to E position 0: fully exact path x0=(1,0); Q0=K0=V0=(1,0) score S00=1; softmax=[1] exact attn0=(1,0); residual1=(2,0) ln1 mean=1; centered=(1,-1); var=1; std=1 ln1 output=(1,-1) MLP ReLU=(1,0); mlp=(1,0) residual2=(2,-1) ln2 mean=1/2; centered=(3/2,-3/2); var=9/4; std=3/2 ln2 output=(1,-1) logits: a=1, b=-1, c=0 argmax=a; output token=a position 1: named softmax boundary x1=(1,1); Q1=(1,1) K0=(1,0); K1=(1,1) scores=[1, 2] softmax=[e^1/(e^1+e^2), e^2/(e^1+e^2)] named after multi-entry softmax ordered pipeline tokens -> embed -> +pos -> attention -> +residual -> layernorm -> MLP -> +residual -> layernorm -> unembed -> logits -> argmax -> output token pos0 remains exact; pos1 stops at named softmax