Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Since Anthropic released the “Computer Use” feature for Claude in October, there has been a lot of excitement about what AI agents can do when given the power to imitate human interactions. A new study by Show Lab at the National University of Singapore provides an overview of what we can expect from the current generation of graphical user interface (GUI) agents.
Claude is the first frontier model that can interact as a GUI agent with a device through the same interfaces humans use. The model only accesses desktop screenshots and interacts by triggering keyboard and mouse actions. The feature promises to enable users to automate tasks through simple instructions and without the need to have API access to applications.
The researchers tested Claude on a variety of tasks including web search, workflow completion, office productivity and video games. Web search tasks involve navigating and interacting with websites, such as searching for and purchasing items or subscribing to news services. Workflow tasks involve multi-application interactions, such as extracting information from a website and inserting it into a spreadsheet. Office productivity tasks test the agent’s ability to perform common operations such as formatting documents, sending emails and creating presentations. The video game tasks evaluate the agent’s ability to perform multi-step tasks that require understanding the logic of the game and planning actions.
Each task tests the model’s ability across three dimensions: planning, action and critic. First, the model must come up with a coherent plan to accomplish the task. It must then be able to carry out the plan by translating each step into an action, such as opening a browser, clicking on elements and typing text. Finally, the critic element determines whether the model can evaluate its progress and success in accomplishing the task. The model should be able to understand if it has made errors along the way and correct course. And if the task is not possible, it should give a logical explanation. The researchers created a framework based on these three components and reviewed and rated all tests by humans.
In general, Claude did a great job of carrying out complex tasks. It was able …