Anthropic Wants Its AI Agent to Control Your Computer

avatar
WIRED
7 hours ago

It took a while for people to adjust to the idea of chatbots that seem to have minds of their own. The next leap into the unknown may involve trusting artificial intelligence to take over our computers, too.

Anthropic, a high-flying competitor to OpenAI, today announced that it has taught its AI model Claude to do a range of things on a computer, including searching the web, opening applications, and inputting text using the mouse and keyboard.

“I think we're going to enter into a new era where a model can use all of the tools that you use as a person to get tasks done,” says Jared Kaplan, chief science officer at Anthropic and an associate professor at Johns Hopkins University.

Kaplan showed WIRED a pre-recorded demo in which an “agentic”—or tool-using—version of Claude had been asked to help plan an outing to see the sunrise at the Golden Gate Bridge with a friend. In response to the prompt, Claude opened the Chrome web browser, looked up relevant information on Google, including the ideal viewing spot and the optimal time to be there, then used a calendar app to create an event to share with a friend (It did not include further instructions, such as what route to take to get there in the least amount of time.)

In a second demo, Claude was asked to build a simple website to promote itself. In a surreal moment, the model inputted a text prompt into its own web interface to generate the necessary code. It then used Visual Studio Code, a popular code editor developed by Microsoft, to write a simple website, and opened a text terminal to spin up a simple web server to test the site. The website offered a decent, 1990s-themed landing page for the AI model. When the user asked it to fix a problem on the resulting website, the model returned to the editor, identified the offending snippet of code, and deleted it.

Mike Krieger, chief product officer at Anthropic, says the company hopes that so-called AI agents will automate routine office tasks and free people up to be more productive in other areas. “What would you do if you got rid of a bunch of hours of copy and pasting or whatever you end up doing?” he says. “I'd go and play more guitar.”

Anthropic is making the agentic abilities available through its application programming interface (API) for its most powerful multimodal large language model, Claude 3.5 Sonnet, from today. The company also announced a new and improved version of a smaller model, Claude 3.5 Haiku, today.

Demos of AI agents can seem stunning but getting the technology to perform reliably and without annoying, or costly, errors in real life can be a challenge. Current models can answer questions and converse with almost human-like skill and are the backbone of chatbots such as OpenAI’s ChatGPT and Google’s Gemini. They can also perform tasks on computers when given a simple command by accessing the computer screen as well as input devices like a keyboard and trackpad or through low-level software interfaces.

Anthropic says that Claude outperforms other AI agents on several key benchmarks including SWE-bench, which measures an agent's software development skills and OSWorld, which gauges an agent's capacity to use a computer operating system. The claims have yet to be independently verified. Anthropic says Claude performs tasks in OSWorld correctly 14.9 percent of the time. This is well below humans, who generally score around 75 percent, but considerably higher than the current best agents, including OpenAI’s GPT-4, which succeed roughly 7.7 percent of the time.

Anthropic claims that several companies are already testing the agentic version of Claude. This includes Canva, which is using it to automate design and editing tasks and Replit, which uses the model for coding chores. Other early users include The Browser Company, Asana and Notion.

Ofir Press, a postdoctoral researcher at Princeton University who helped develop SWE-bench, says that agentic AI tends to lack the ability to plan far ahead and often struggle to recover from errors. “In order to show them to be useful we must obtain strong performance on tough and realistic benchmarks,” he says, like reliably planning a wide range of trips for a user and booking all the necessary tickets.

Kaplan notes that Claude can already troubleshoot some errors surprisingly well. When faced with a terminal error when trying to start a web server, for instance, the model knew how to revise its command to fix it. It also worked out that it had to enable popups when it ran into a dead end browsing the web.

Many tech companies are now racing to develop AI agents as they chase market share and prominence. In fact, it may not be long before many users have agents at their fingertips. Microsoft, which has poured upwards of $13 billion into OpenAI, says it is testing agents that can use Windows computers. Amazon, which has invested heavily in Anthropic, is exploring how agents could recommend and eventually buy goods for its customers.

Sonya Huang, a partner at the venture firm Sequoia who focuses on AI companies, says for all the excitement around AI agents, most companies are really just rebranding AI-powered tools. Speaking to WIRED ahead of the Anthropic news, she says that the technology works best currently when applied in narrow domains such as coding-related work. “You need to choose problem spaces where if the model fails, that's okay,” she says. “Those are the problem spaces where truly agent native companies will arise.”

A key challenge with agentic AI is that errors can be far more problematic than a garble chatbot reply. Anthropic has imposed certain constraints on what Claude can do, for example limiting its ability to use a person’s credit card to buy stuff.

If errors can be avoided well enough, says Press of Princeton University, users may learn to see AI—and computers—in a completely new way. “I'm super excited about this new era,” he says.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
1
Add to Favorites
1
Comments