ByteDance’s UI-TARS can take over your computer, outperforms GPT-4o and Claude

by | Jan 22, 2025 | Technology

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

A new AI agent has emerged from the parent company of TikTok to take control of your computer and perform complex workflows.

Much like Anthropic’s Computer Use, ByteDance’s new UI-TARS understands graphical user interfaces (GUIs), applies reasoning and takes autonomous, step-by-step action. 

Trained on roughly 50B tokens and offered in 7B and 72B parameter versions, the PC/MacOS agents achieves state-of-the-art (SOTA) performance on 10-plus GUI benchmarks across performance, perception, grounding and overall agent capabilities, consistently beating out OpenAI’s GPT-4o, Claude and Google’s Gemini.

“Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention,” researchers from ByteDance and Tsinghua University write in a new research paper. 

Source: Arxiv

How UI-TARS explains its thinking

UI-TARS works across desktop, mobile and web applications, using multimodal inputs (text, images, interactions) to understand visual environments.

Its UI features two tabs — one to the left showing its step-by-step “thinking,” and a larger one to the right where it pulls up files, websites and apps and automatically takes action. 

For example, in a demo video released today, the model is prompted to “Find round trip flights from SEA to NYC on the 5th and return on the 10th next month and filter by price in ascending order.”

In response, UI-TARS navigates to the website for Delta Airlines, fills in the “from” and “to” fields, clicks in the relevant dates and sorts and filters by price, explaining each step in its thinking box before taking action. 

In another scenario, it is instructed to install an autoDocstring extension in VS Code. Here is its step-by-step thinking as it completes that task: 

It first reports that “to proceed with installing the autoDocstring extension in VS Code, I first need to open the VS Code application.” 

Recognizing that the VS Code window has not fully loaded, the model notes that “it is necessary to wait for the application to finish initializing before proceeding with the installation of the autoDocstring extension. This ensures that all required functionalities are available for the next steps.” 

It then explains that it must access the Extensions view in VS Code, which can be done by clicking on a tab in the left sidebar. 

Running into a small glitch, it reasons: “The previous attempt to click on the ‘Extensions’ tab may not have been precise enough, so I will try clicking on it again to ensure the Extensions view opens successfully.”

Once in the Extensions view, it types in ‘autoDocstring’ before concluding: “Since the installation process has already been initiated, there is no further action required at this moment. I will wait for the installation to complete and confirm its success once the process is finished.”

Outperforming its rivals

Across a variety of benchmarks, researchers report that UI-TARS consistently outranked OpenAI’s GPT-4o; Anthropic’s Claude-3.5-Sonnet; Gemini-1.5-Pro and Gemini-2.0; four Qwen models; and numerous academic models.

For instance, in VisualWebBench — which measures a model’s ability to ground web elements including webpage quality assurance and optical character recognition — UI-TARS 72B scored 82.8%, outperforming GPT-4o (78.5%) and Claude 3.5 (78.2%).  …

Article Attribution | Read More at Article Source