Attention Similarity Bias is Worse in Vision
70% of what a ViT token gets back from attention is information it already has. I measure this across patch sizes, show it compounds through layers, and test a projection that removes it.
Hi! I'm an ML Principal at Salesforce. Previously founding ML engineer at Convergence where I created our first web agents and developed our OS web agent proxy lite. Before that, ML engineer at Cohere where I started the Command project and led the creation of our SOTA reward models.
70% of what a ViT token gets back from attention is information it already has. I measure this across patch sizes, show it compounds through layers, and test a projection that removes it.
Why I think Coding Agents are the future of agentic computer use.
A browser-based editor for making retro games with Pyxel, a Python retro game engine.
Finding the physics of water bending.
Learning from state changes in 1 million Python programs.