A causal mask controls which earlier positions can contribute. The operation is structural and exact: masked cells are excluded before softmax rather than approximated.

highlighted = computed this step

The causal mask

The causal rule keeps keys with index j less than or equal to the query index i. Cells above that diagonal are masked out before softmax.

keep ji\text{keep }j\le i
Causal maskUpper-triangle entries are structurally excluded.Causal maskUpper-triangle entries are structurally excluded.exact scores with causal maskdisplayed Q,K integer vectors are the score source; causal mask keeps j<=iQ1=(1,0); Q2=(0,1); Q3=(1,1)K1=(1,0); K2=(0,1); K3=(1,1)score S_ij=Qi·Kj with causal maskK1K2K3Q1Q2Q31maskedmasked01masked112masked cells are structurally excluded before softmax

Masked cells are not numbers

A masked cell is structurally excluded, not replaced by a displayed decimal. Row 1 keeps one score, row 2 keeps two scores, and row 3 keeps three scores.

allowed counts 1,2,3\text{allowed counts }1,2,3
Causal maskUpper-triangle entries are structurally excluded.Causal maskUpper-triangle entries are structurally excluded.exact scores with causal maskdisplayed Q,K integer vectors are the score source; causal mask keeps j<=iQ1=(1,0); Q2=(0,1); Q3=(1,1)K1=(1,0); K2=(0,1); K3=(1,1)score S_ij=Qi·Kj with causal maskK1K2K3Q1Q2Q31maskedmasked01masked112masked cells are structurally excluded before softmax

Summary

The mask is exact because it is just a position rule. It decides the row slices that the named softmax will receive next.

mask first; softmax later\text{mask first; softmax later}
Causal maskUpper-triangle entries are structurally excluded.Causal maskUpper-triangle entries are structurally excluded.exact scores with causal maskdisplayed Q,K integer vectors are the score source; causal mask keeps j<=iQ1=(1,0); Q2=(0,1); Q3=(1,1)K1=(1,0); K2=(0,1); K3=(1,1)score S_ij=Qi·Kj with causal maskK1K2K3Q1Q2Q31maskedmasked01masked112masked cells are structurally excluded before softmax