Attention Similarity Bias is Worse in Vision

In a trained ViT-B/14, inter-token value similarity grows consistently after layer 5. In a ViT-B/32, it stays flat. The self-similarity compounds: similar patches produce similar value vectors, attention mixes them, and the next layer’s input is more similar still.

Zhai (2026) identified this pattern in language models and called it the “attention similarity bias.” The fix, Exclusive Self Attention (XSA), projects out the $V_{\text{self}}$ component after attention:

Vn = F.normalize(V, dim=-1)
Z = Y - (Y * Vn).sum(dim=-1, keepdim=True) * Vn

In language at 2048 tokens, $\cos(Y, V_{\text{self}})$ reaches 0.1 to 0.6. In vision at 256 tokens, it hits 0.7. I tested XSA on CLIP vision transformers to see whether the stronger bias translates to stronger gains.

Where it comes from

For a 224px input a ViT-B/14 divides the image into a 16x16 grid. A 14x14 pixel patch of grass looks like the patch of grass next to it. A ViT-B/32 gets just a 7x7 grid where each patch captures enough of the scene to be distinct.

At ViT-B/32, the most attended patches are grass, trees, jacket. At ViT-B/14, they’re near-copies of the target. This feeds directly into the per-layer diagnostic:

B/14’s inter-token value similarity doubles between layers 5 and 11. B/32 and B/16 never ignite. The similarity is concentrated along the $V_{\text{self}}$ direction: measuring $\cos(V_i, V_j)$ on just the orthogonal components gives near-zero for all models. XSA targets exactly the axis where the redundancy accumulates.

Modern VLMs push well beyond 256 tokens. Qwen2.5-VL uses up to 16,384 visual tokens. SigLIP runs at 384px (576 tokens). The compounding should be stronger still.

Training results

Baseline $\cos(V_i, V_j)$ almost perfectly predicts XSA’s effect. The +3.4pp gain at B/14 compares favourably to the language results (+0.26 to +1.36 across 0.7B to 2.7B). These results at 256 tokens are just the beginning of the curve.

Beyond $V_{\text{self}}$

XSA’s strength is that the redundant direction comes from the architecture, not from data. The residual stream carries $V_{\text{self}}$ , so $V_{\text{self}}$ in the attention output is structurally redundant.

Its weakness is the hard projection. At 49 tokens, the direction is correct but the redundancy isn’t strong enough to justify fully removing it. A learned per-head scalar $\alpha$ on the projection strength would handle this: project strongly in later layers of B/14 where similarity compounds, back off at B/32 where it doesn’t.

This sits between XSA and Gated Attention, which learns a full element-wise sigmoid gate on each head’s output. Gated Attention can suppress any kind of uninformative output but has to discover what to suppress from data. A gated $V_{\text{self}}$ projection would be more targeted: the architecture tells you what to remove, the gate learns how much.

The same diagnostic approach could identify redundant directions elsewhere in the transformer block. Measure alignment between a module’s output and what the residual stream already carries. If a module consistently biases toward information the skip connection already provides, there’s a projection to try.

Trained on CC12M with OpenCLIP.

Where it comes from

Training results

Beyond VselfV_{\text{self}}Vself​

Beyond $V_{\text{self}}$