A real attention head shows the familiar attention structure with captured floating-point weights. The lesson also names the GPT-Neo differences from the tiny exact transformer.

highlighted = computed this step

Real attention

The heatmap shows one captured attention head. Every displayed cell is present, and each weight is a captured floating-point value shown with an approximate label.

one real attention head: every cell displayed\text{one real attention head: every cell displayed}
Captured attention headEvery cell in one real 4 by 4 attention head.Captured attention headEvery cell in one real 4 by 4 attention head.attention head heatmapCaptured attention weights: step 0, layer 0, head 0.tensor: step00.attention_00 shape [1, 16, 4, 4] | every q,k cell displayedk0k1k2k3q0≈1q0->k0≈0q0->k1≈0q0->k2≈0q0->k3q1≈0.479708q1->k0≈0.520292q1->k1≈0q1->k2≈0q1->k3q2≈0.30865q2->k0≈0.161946q2->k1≈0.529404q2->k2≈0q2->k3q3≈0.220487q3->k0≈0.199623q3->k1≈0.403692q3->k2≈0.176197q3->k3captured fp32, shown to 6 sig-digitsno attention weight is omitted; colors are distilled display metadata, values are captured cells

Architecture delta

Books 14 through 16 used a tiny vanilla causal-attention shape. TinyStories is GPT-Neo here: its layers alternate GLOBAL and LOCAL sliding-window causal attention, and its MLP uses GELU rather than the exact ReLU from the hand-computed books.

GPT-Neo: global/local attention plus GELU\text{GPT-Neo: global/local attention plus GELU}
Captured attention headEvery cell in one real 4 by 4 attention head.Captured attention headEvery cell in one real 4 by 4 attention head.attention head heatmapCaptured attention weights: step 0, layer 0, head 0.tensor: step00.attention_00 shape [1, 16, 4, 4] | every q,k cell displayedk0k1k2k3q0≈1q0->k0≈0q0->k1≈0q0->k2≈0q0->k3q1≈0.479708q1->k0≈0.520292q1->k1≈0q1->k2≈0q1->k3q2≈0.30865q2->k0≈0.161946q2->k1≈0.529404q2->k2≈0q2->k3q3≈0.220487q3->k0≈0.199623q3->k1≈0.403692q3->k2≈0.176197q3->k3captured fp32, shown to 6 sig-digitsno attention weight is omitted; colors are distilled display metadata, values are captured cells

Summary

The structure is recognizable from the attention book, but the magnitudes are captured floats from the real run. The color encoding is validated from the same captured bytes as the displayed weight.

captured valuedisplay and heatmap color\text{captured value}\rightarrow\text{display and heatmap color}
Captured attention headEvery cell in one real 4 by 4 attention head.Captured attention headEvery cell in one real 4 by 4 attention head.attention head heatmapCaptured attention weights: step 0, layer 0, head 0.tensor: step00.attention_00 shape [1, 16, 4, 4] | every q,k cell displayedk0k1k2k3q0≈1q0->k0≈0q0->k1≈0q0->k2≈0q0->k3q1≈0.479708q1->k0≈0.520292q1->k1≈0q1->k2≈0q1->k3q2≈0.30865q2->k0≈0.161946q2->k1≈0.529404q2->k2≈0q2->k3q3≈0.220487q3->k0≈0.199623q3->k1≈0.403692q3->k2≈0.176197q3->k3captured fp32, shown to 6 sig-digitsno attention weight is omitted; colors are distilled display metadata, values are captured cells