How will AI Agents use computers?

The first AGI will be a computer using agent, the computer is the highest-leverage tool known to man, so how AI agents will use computers is the question of our time.

In academia the term “computer use agents” (CUA) refers to an agent that clicks and types on a computer screen just like a human would. CUAs work by sending screenshots to a VLM and having it decide on the next keyboard or mouse action to complete the task.

from the amazing OpenCUA

Unlike coding agents and chat models, CUAs aren’t reaching superhuman levels of performance and even when they reach human level they’re slow and expensive.

So what’s the alternative? There’s another kind of agent that runs on a computer: a “coding agent”. Coding agents like Claude Code skip the virtual keyboard and mouse and instead just work through the terminal.

Initially this seems like a step backwards from CUAs, by just using the terminal we’re losing the graphic user interface surely that leaves you worse off?

Let’s go back to why we got the graphic user interfaces in the first place.

Why have a GUI in the first place?

In the 1960s GUIs were developed to allow non-specialists to work on documents with the hope of creating the paperless offices we work in today. You’ll notice that to this day “desktops” and “folders” are all computer metaphors for the sheets of paper common in 20th century offices.

Specifically GUIs implemented:

Displaying affordances in a simple and easy to understand format.
Pointing at visible objects instead of memorising commands.

Original GUIs were supposed to be even more like writing and editing paper documents! Turns out nobody wants a stylus.

Now let’s think about what VLMs actually get from using GUIs:

Affordances avoid memorising commands ❌

VLMs can store a far wider breadth of knowledge on all applications than a person can - they don’t need visual clues on how to change the left margin in Word. Even when they lack inherent knowledge, they’re much more efficient at processing text-based application info rather than viewing a small subset of features on a screen.

Pointing and clicking is easier than typing ❌

For VLMs, clicking is just writing out click(13,594) and after each click the VLM needs to view a new screenshot.

Instead of coding:

System.NetworkPreferences.MobileData.toggle()

The agent must take several separate actions:

click(System), click(Network Preferences), click(Toggle Mobile Data)

Can’t you just chain together multiple clicks at once? Yes, but what if a pop-up or alert changes the layout? How will you decide to cut the actions short? Also, your agent would have to memorise the positions of each element, and what happens when the underlying GUI changes?

So none of the original justifications for the Graphic User Interface hold up for use with modern VLMs. The only justification for them right now is when using legacy (pre AI) applications and pretending to be a human user.

Another important feature not yet mentioned: because coding agents are really effective, we should expect CLI tools and MCP servers (Model Context Protocol) to expand rapidly as software becomes cheaper to generate.

So what will happen?

I think over 90% of AI agent computer tasks will be done via the terminal and MCP servers. Sure GUI computer use will be present when needed but will not be here to stay long term.

A contrast to robotics.

In robotics there’s a historic argument over if they should be humanoid or not. A humanoid robot can use all our existing interfaces while sophisticated robots can be way better at their specific tasks.

Its an open question here because it costs a fair bit to build a unique robot for each tasks. In computer-use there’s an analogous question of should we make a human-like computer using agent or provide agent-specific tools for the task and use those instead.

Since the cost of producing software is trending towards zero I think it will make a lot more sense to generate agent-specific tools than rely on “humanoid” computer using agents.

Robotics analogy — Humanoid vs specialized robots - an analogy for AI agents