XSA scales with patch similarity

XSA helps when value vectors are similar (B/14) but hurts when they are distinct (B/32).
XSA effect by model -9.108 -6.272 -3.435 -0.599 2.238 5.074 XSA delta top-1 (pp) B/32 -7.47pp B/16 -2.34pp B/14 +3.44pp
Value similarity predicts XSA effect -9.108 -6.272 -3.435 -0.599 2.238 5.074 0.030 0.074 0.118 0.162 0.206 0.250 Baseline mean cos(Vi, Vj) XSA delta top-1 (pp) B/32 B/16 B/14
Primary variants: B/32 default, B/16 default, B/14 v2. cos(Vi, Vj) averaged across all 12 layers from checkpoint diagnostics.