Anthropic released two new models: Claude 3.5 Haiku and an improved version of Claude 3.5 Sonnet. They also released a new feature for Claude 3.5 Sonnet that allows the model to interact with a computer's GUI the same way a human user does.
Claude 3.5 Haiku is the company's fastest model; the new version outperforms larger models such as GPT-4o and the previous generation of Claude 3.5 Sonnet on the SWE-bench Verified coding benchmark. The upgraded Claude 3.5 Sonnet performs even better on that benchmark, "higher than all publicly available models" according to Anthropic. The model also supports a new feature, computer use, which allows it to interact with a computer by interpreting the images on the screen, moving the mouse pointer, clicking buttons, and entering text via a virtual keyboard. This allows the model to interact with virtually any program, not just ones that support an API. According to Anthropic,
Computer use is a completely different approach to AI development. Up until now, LLM developers have made tools fit the model, producing custom environments where AIs use specially-designed tools to complete various tasks. Now, we can make the model fit the tools—Claude can fit into the computer environments we all use every day. Our goal is for Claude to take pre-existing pieces of computer software and simply use them as a person would.
The computer use feature relies on Claude's ability to interpret images. Anthropic describes it as "taking screenshots and piecing them together." One key advancement was training the model to accurately count pixels; many LLMs struggle with similar tasks such as counting the number of letters in a word. Without this skill, the model would be unable to move the computer mouse to the proper place.
Claude currently has the top spot on the OSWorld benchmark leaderboard, which tracks the ability of AI agents to interact with computers. While humans typically score higher than 70% on this benchmark, Claude's best score is 14.9%. However, GPT-4, "the next-best AI model in the same category" according to Anthropic, scores only 7.7%.
Users on Hacker News discussed the computer use feature, pointing out its potential for automating a wide range of common business processes.
This is actually a huge deal. As someone building AI SaaS products, I used to have the position that directly integrating with APIs is going to get us most of the way there in terms of complete AI automation...I started to realize that pretty much most of the real world runs on software that directly interfaces with people, without clearly defined public APIs you can integrate into...I am glad they did this, since it is a powerful connector to these types of real-world business use cases that are super-hairy, and hence very worthwhile in automating.
Anthropic notes that the feature still "remains slow and often error-prone." Alex Albert, the company's head of Claude relations, posted on X that:
It's not perfect yet. The model struggles at times with basic computer actions which can lead to some amusing moments. While filming demos, Claude accidentally stopped a long-running screen recording, causing all footage to be lost. Later, Claude took a break from the coding demo and began to browse photos of Yellowstone National Park.
The computer use feature is currently in public beta. Anthropic also released example code on GitHub demonstrating how to use the feature.