April 2026 Puzzle 1(a)

How to read Head 3's OV matrix after finding max-selective attention

The attention result says Head 3 usually looks from [ANS] to a max-valued number token. The OV analysis asks what Head 3 writes after it has selected that token.

What we know so far

In the stratified 10,000-example sample, Head 3's top number-position attention chose a max digit on every sampled input. It also sent almost all attention mass to max positions.

Head Top number is max Mass to max Mass to specials
0 0.5107 0.1748 0.8237
1 0.0388 0.0000 1.0000
2 0.6108 0.4393 0.5571
3 1.0000 0.9899 0.0016
Aggregate ANS attention metrics by head
Aggregate [ANS] attention metrics from the report.

The split: QK selects, OV writes

QK side

Attention scores decide which earlier token Head 3 reads from. Our result points to a max-selection mechanism: from [ANS], Head 3's attention lands on a maximum digit position.

OV side

The value and output matrices decide what information is written from the selected token into the residual stream at [ANS]. This is the next thing to measure.

Matrix sizes for Head 3 OV

The model has d_model = 64, n_heads = 4, so each attention head has d_head = 16. PyTorch stores linear weights as (out_features, in_features), so row-vector equations use transposed weights.

Token residual

h_j = tok_embed[d] + pos_embed[pos]
h_j shape: (1, 64)

Value projection

W_V3.weight shape: (16, 64)
h_j @ W_V3.T
(1, 64) @ (64, 16) = (1, 16)

Output projection slice

Head 3 slice: columns 48:64
W_O3 shape: (16, 64)
value @ W_O3
(1, 16) @ (16, 64) = (1, 64)

Unembedding

unembed.weight shape: (14, 64)
residual_write @ unembed.T
(1, 64) @ (64, 14) = (1, 14)

Combined OV-to-logit map

OV_logits_3 = W_V3.T @ W_O3 @ unembed.weight.T

(64, 16) @ (16, 64) @ (64, 14) = (64, 14)

How to interpret the output

To ask what Head 3 does when it attends to digit d, multiply digit embeddings by the combined OV-to-logit map.

digit_effects = tok_embed.weight[:10] @ OV_logits_3

(10, 64) @ (64, 14) = (10, 14)

The entry digit_effects[source_digit, output_token] is the direct logit contribution caused by Head 3 reading that source digit. The most important part is the first ten output columns, one for each answer digit 0 through 9.

What a copy circuit would look like

A clean copy-style OV circuit would make the digit-to-digit matrix diagonal: source 7 boosts output 7, source 9 boosts output 9, and so on.

source 0 -> output 0 high
source 1 -> output 1 high
...
source 9 -> output 9 high
Head attention to number positions for example 3 7 2 5 1
Example input [3, 7, 2, 5, 1]: Head 3 selects the max digit.
Attention behavior grouped by true max value
Grouped by true max value, Head 3 stays max-selective across the sample.

The next concrete test

  1. Compute token-only digit_effects[:10, :10] for Head 3.
  2. Plot it as a 10 by 10 heatmap: source digit on rows, output digit on columns.
  3. Repeat with position included: tok_embed[d] + pos_embed[pos], producing five heatmaps.
  4. Compare diagonal strength against off-diagonal strength.

If the OV matrix is diagonal-ish, then Head 3 likely performs both halves of the solution: QK selects the maximum token, and OV copies the selected digit into the answer logits.