XSA scales with patch similarity
XSA helps when value vectors are similar (B/14) but hurts when they are distinct (B/32).
XSA effect by model
-9.108
-6.272
-3.435
-0.599
2.238
5.074
XSA delta top-1 (pp)
B/32
-7.47pp
B/16
-2.34pp
B/14
+3.44pp
Value similarity predicts XSA effect
-9.108
-6.272
-3.435
-0.599
2.238
5.074
0.030
0.074
0.118
0.162
0.206
0.250
Baseline mean cos(Vi, Vj)
XSA delta top-1 (pp)
B/32
B/16
B/14
Primary variants: B/32 default, B/16 default, B/14 v2. cos(Vi, Vj) averaged across all 12 layers from checkpoint diagnostics.