Real Attention

A real attention head shows the familiar attention structure with captured floating-point weights. The lesson also names the GPT-Neo differences from the tiny exact transformer.

highlighted = computed this step

The heatmap shows one captured attention head. Every displayed cell is present, and each weight is a captured floating-point value shown with an approximate label.

\text{one real attention head: every cell displayed}

Architecture delta

Books 14 through 16 used a tiny vanilla causal-attention shape. TinyStories is GPT-Neo here: its layers alternate GLOBAL and LOCAL sliding-window causal attention, and its MLP uses GELU rather than the exact ReLU from the hand-computed books.

\text{GPT-Neo: global/local attention plus GELU}

Summary

The structure is recognizable from the attention book, but the magnitudes are captured floats from the real run. The color encoding is validated from the same captured bytes as the displayed weight.

\text{captured value}\rightarrow\text{display and heatmap color}