OpenAI released a research preview of Operator, an AI agent that can use a web browser to perform tasks on a user's behalf. Operator achieves new state-of-the-art performance on the WebArena and WebVoyager benchmarks.
To build Operator, OpenAI developed a new model called Computer-Using Agent (CUA), which is derived from GPT-4o. It relies on GPT-4o's vision ability to understand the contents of a browser screen, and it is further trained to interact with GUI elements like buttons and menus. To perform a task, it iteratively loops through a series of perception, reasoning, and acting steps until the task is complete. OpenAI has built in several safety guardrails: for example, Operator will require the user to take over when entering passwords, and it will refuse some high-risk tasks such as banking transactions. According to OpenAI:
We have made significant progress in deep reasoning through the o-model series, vision capabilities through GPT-4o, and new techniques to improve robustness through reinforcement learning and instruction hierarchy. The next challenge space we plan to explore is expanding the action space of agents. The flexibility offered by a universal interface addresses this challenge, enabling an agent that can navigate any software tool designed for humans. By moving beyond specialized agent-friendly APIs, CUA can adapt to whatever computer environment is available—truly addressing the "long tail" of digital use cases that remain out of reach for most AI.
In late 2024, InfoQ covered Anthropic's release of the Computer Use feature, which allows their Claude model to interact with a computer by interpreting the images on the screen, moving the mouse pointer, clicking buttons, and entering text via a virtual keyboard. Claude set records on several OS and web use benchmarks, but Operator outperforms it on WebArena, WebVoyager, and OSWorld. However, Operator still falls short of human performance on these tasks: for example, it scores 38.1% on OSWorld vs. over 70% for humans.
CUA Benchmark Scores. Image Source: OpenAI's CUA Report
Because Operator can take actions on websites, OpenAI added several safety measures beyond those already built into GPT-4o. Particularly important are the safeguards against adversarial attacks by malicious websites, including prompt injection and phishing. OpenAI used red-teams to test the safeguards, and claim that their mitigation against prompt injection worked in "all but one case."
AI researcher and entrepreneur Andrej Karpathy wrote about Operator on X:
Projects like OpenAI’s Operator are to the digital world as humanoid robots are to the physical world. One general setting (monitor keyboard and mouse, or human body) that can in principle gradually perform arbitrarily general tasks, via an I/O interface originally designed for humans. In both cases, it leads to a gradually mixed-autonomy world, where humans become high-level supervisors of low-level automation. A bit like a driver monitoring the Autopilot. This will happen faster in the digital world than in the physical world because flipping bits is somewhere around 1000X less expensive than moving atoms. Though the market size and opportunity feels a lot bigger in the physical world.
Operator is only available via the web for ChatGPT Pro users. OpenAI intends to expand this to other paid ChatGPT plans "once we are confident in its safety and usability at scale," and to make the underlying CUA model available via API.