QK side
Attention scores decide which earlier token Head 3 reads from. Our
result points to a max-selection mechanism: from [ANS],
Head 3's attention lands on a maximum digit position.
April 2026 Puzzle 1(a)
The attention result says Head 3 usually looks from [ANS] to a
max-valued number token. The OV analysis asks what Head 3 writes after it
has selected that token.
In the stratified 10,000-example sample, Head 3's top number-position attention chose a max digit on every sampled input. It also sent almost all attention mass to max positions.
| Head | Top number is max | Mass to max | Mass to specials |
|---|---|---|---|
| 0 | 0.5107 | 0.1748 | 0.8237 |
| 1 | 0.0388 | 0.0000 | 1.0000 |
| 2 | 0.6108 | 0.4393 | 0.5571 |
| 3 | 1.0000 | 0.9899 | 0.0016 |
[ANS] attention metrics from the report.
Attention scores decide which earlier token Head 3 reads from. Our
result points to a max-selection mechanism: from [ANS],
Head 3's attention lands on a maximum digit position.
The value and output matrices decide what information is written from
the selected token into the residual stream at [ANS].
This is the next thing to measure.
The model has d_model = 64, n_heads = 4, so each
attention head has d_head = 16. PyTorch stores linear weights
as (out_features, in_features), so row-vector equations use
transposed weights.
h_j = tok_embed[d] + pos_embed[pos]
h_j shape: (1, 64)
W_V3.weight shape: (16, 64)
h_j @ W_V3.T
(1, 64) @ (64, 16) = (1, 16)
Head 3 slice: columns 48:64
W_O3 shape: (16, 64)
value @ W_O3
(1, 16) @ (16, 64) = (1, 64)
unembed.weight shape: (14, 64)
residual_write @ unembed.T
(1, 64) @ (64, 14) = (1, 14)
OV_logits_3 = W_V3.T @ W_O3 @ unembed.weight.T
(64, 16) @ (16, 64) @ (64, 14) = (64, 14)
To ask what Head 3 does when it attends to digit d, multiply
digit embeddings by the combined OV-to-logit map.
digit_effects = tok_embed.weight[:10] @ OV_logits_3
(10, 64) @ (64, 14) = (10, 14)
The entry digit_effects[source_digit, output_token] is the
direct logit contribution caused by Head 3 reading that source digit.
The most important part is the first ten output columns, one for each
answer digit 0 through 9.
A clean copy-style OV circuit would make the digit-to-digit matrix
diagonal: source 7 boosts output 7, source
9 boosts output 9, and so on.
source 0 -> output 0 high
source 1 -> output 1 high
...
source 9 -> output 9 high
[3, 7, 2, 5, 1]: Head 3 selects the max digit.
digit_effects[:10, :10] for Head 3.tok_embed[d] + pos_embed[pos], producing five heatmaps.If the OV matrix is diagonal-ish, then Head 3 likely performs both halves of the solution: QK selects the maximum token, and OV copies the selected digit into the answer logits.