Per-layer attention diagnostics

Two measures of attention similarity bias across 12 transformer layers.
Self-value projection in output 0.224 0.305 0.386 0.467 0.549 0.630 0 1 2 3 4 5 6 7 8 9 10 11 Layer cos(Y, V_self) B/32 B/16 B/14
Inter-token value similarity -0.006 0.088 0.182 0.275 0.369 0.463 0 1 2 3 4 5 6 7 8 9 10 11 Layer cos(V_i, V_j) B/32 B/16 B/14
Averaged over 8 ImageNet validation images per model. All models are baselines (no XSA) evaluated at their training checkpoints.