A real attention head shows the familiar attention structure with captured floating-point weights. The lesson also names the GPT-Neo differences from the tiny exact transformer.
highlighted = computed this step
Real attention
The heatmap shows one captured attention head. Every displayed cell is present, and each weight is a captured floating-point value shown with an approximate label.
one real attention head: every cell displayed
Architecture delta
Books 14 through 16 used a tiny vanilla causal-attention shape. TinyStories is GPT-Neo here: its layers alternate GLOBAL and LOCAL sliding-window causal attention, and its MLP uses GELU rather than the exact ReLU from the hand-computed books.
GPT-Neo: global/local attention plus GELU
Summary
The structure is recognizable from the attention book, but the magnitudes are captured floats from the real run. The color encoding is validated from the same captured bytes as the displayed weight.