Agent TARS & UI-TARS Desktop: ByteDance's Open-Source Multimodal AI Agent Stack

Contents

Introduction

ByteDance has open-sourced a powerful multimodal AI agent stack that is turning heads in the developer community. The project, hosted at bytedance/UI-TARS-desktop on GitHub, has already amassed over 36.8K stars. It delivers two complementary products: Agent TARS (a general-purpose AI agent with CLI and Web UI) and UI-TARS Desktop (a native desktop GUI agent). Unlike frameworks that simply wrap GPT-4o with system prompts, UI-TARS is an end-to-end vision-language model that perceives and acts on screenshots directly. This distinction matters — and it is what makes the TARS stack genuinely exciting.

What Is the TARS Stack?

The TARS stack consists of two separate but related projects, both built on top of the UI-TARS Vision-Language Model (VLM) detailed in the paper UI-TARS: Pioneering Automated GUI Interaction with Native Agents (arXiv 2501.12326).

Agent TARS is a general-purpose multimodal agent that operates via a CLI or a Web UI. It is designed for broader browser automation and tool-use tasks. Think of it as your AI-powered assistant that can browse the web, extract data, fill forms, and orchestrate multiple tools — all through a single command-line interface.

UI-TARS Desktop, by contrast, is a native desktop application that acts as a GUI agent for your local computer. It takes screenshots of your screen, interprets them visually, and then controls your mouse and keyboard to perform actions. It is a genuine computer-use agent, similar in spirit to what projects like Claude Computer Use attempt, but built from the ground up as a vision-native model.

The core technology underpinning both products is the UI-TARS model itself — a pure vision-language model that achieves state-of-the-art results across 10+ GUI agent benchmarks. The key philosophical difference from wrapper-based approaches is that UI-TARS trains the model to understand GUI screenshots end-to-end, rather than relying on extracted DOM trees or accessibility APIs layered on top of a general-purpose LLM.

Agent TARS — The Multimodal Agent CLI

Agent TARS is the general agent component of the stack. You can get started with a single command:

npx @agent-tars/cli@latest

Or install it globally:

npm install @agent-tars/cli@latest -g

Once installed, you can run it with your preferred model provider:

agent-tars --provider volcengine --model doubao-1-5-thinking-vision-pro-250428 --apiKey your-api-key

Agent TARS introduces a Hybrid Browser Agent architecture that supports three strategies for interacting with web pages:

GUI Agent (vision-based grounding) — the model literally looks at screenshots of the page and decides where to click or what to type.
DOM Agent — the agent operates on the browser’s Document Object Model directly, parsing HTML structure.
Hybrid — a combination of both, using vision when it makes sense and falling back to DOM parsing when precision is needed.

This flexibility is important because real-world web automation is messy. Some tasks benefit from pure vision (e.g., clicking on a specific visual element without a reliable CSS selector), while others need DOM-level precision (e.g., form field interactions).

The CLI is powered by an Event Stream protocol that drives the Context Engineering layer and the Agent UI. This event-driven architecture allows the agent to stream intermediate results, tool calls, and reasoning steps in real time — giving you full visibility into what the agent is doing at every stage.

Version v0.3.0 (released November 2025) introduced several significant improvements:

Streaming support for multiple tools executed concurrently
Runtime settings with timing statistics for performance debugging
Event Stream Viewer for real-time debugging of agent behavior
Exclusive AIO Sandbox support for safe execution environments

Agent TARS was first announced as a public beta on June 25, 2025, and has been iterating rapidly ever since. It is written in TypeScript and is licensed under Apache 2.0.

UI-TARS Desktop — Native GUI Agent

While Agent TARS handles general-purpose browser automation, UI-TARS Desktop is the native desktop application that acts as a full GUI agent for your local computer. It runs natively on Windows, macOS, and also works in a browser environment.

The desktop agent works by:

Taking periodic screenshots of your screen
Running visual recognition on those screenshots using the UI-TARS VLM
Translating the model’s output into precise mouse clicks, keystrokes, and gestures
Executing those actions directly on your operating system

This means you can ask UI-TARS Desktop to perform complex multi-step tasks on your computer — navigating through menus, filling in desktop application forms, dragging and dropping files — all through visual understanding alone.

Version v0.2.0 (June 2025) was a major release that introduced the Remote Computer Operator and Remote Browser Operator. These features allow you to control remote machines or browsers without any configuration — and they are completely free.

Version v0.1.0 (April 2025) shipped the redesigned Agent UI, the initial browser operation features, and support for the UI-TARS-1.5 model.

The Desktop app uses both local and remote operators. Local operators interact with your own screen and peripherals directly, while remote operators connect to other machines via a lightweight protocol, enabling scenarios like managing servers, testing across environments, or providing remote assistance.

The Breakthrough: GUI Agent with Vision

The UI-TARS paper (arXiv 2501.12326) is where the technical depth lies. The model is an end-to-end vision-language model that perceives nothing but screenshots. This is a radical departure from most existing GUI agents that rely on accessibility trees, DOM snapshots, or HTML source code extracted from the page.

The results speak for themselves:

OSWorld benchmark: UI-TARS scores 24.6, outperforming Claude at 22.0
AndroidWorld benchmark: UI-TARS scores 46.6, beating GPT-4o at 34.5

These are not incremental gains. Achieving a ~35% improvement over GPT-4o on AndroidWorld with a purely vision-based approach is a genuine leap forward.

The paper identifies four key innovations that drive this performance:

Enhanced Perception: The model was trained on a large-scale GUI screenshot dataset that includes diverse interfaces — from mobile apps to desktop software to complex web dashboards.
Unified Action Modeling: Instead of separate output heads for different action types, UI-TARS uses a unified action space that covers all GUI interactions.
System-2 Reasoning: The model incorporates deliberate reasoning — task decomposition, reflection, and error recovery — bringing a "think before you act" layer on top of visual perception.
Iterative Training with Reflective Online Traces: An iterative loop where the model performs actions in real environments, records successes and failures, and uses those traces to improve.

Remote Computer & Browser Operator

One of the most practical features in the TARS stack is the Remote Operator capability introduced in v0.2.0. You can control any remote computer or browser — with no configuration required and at no cost.

The remote operator works by establishing a lightweight connection between the local agent and a remote machine. The remote machine’s screen is captured, sent to the local agent for processing, and actions are executed remotely.

For enterprise deployments, UI-TARS can also be deployed on Volcano Engine, ByteDance’s cloud platform, giving teams the ability to run GUI agents at scale in the cloud.

MCP Integration — Connecting to Real Tools

The TARS stack is built on the Model Context Protocol (MCP), an open protocol that standardizes how AI agents interact with external tools and services. The kernel of both products is MCP-native, which means:

Built-in MCP agents — the stack ships with several ready-to-use MCP-compatible agents
MCP server filtering — you can select which MCP servers to expose to the agent
Tools filter — granular control over which specific tools the agent can invoke
Workspace config support — MCP configurations can be managed per workspace
str_replace_editor support — surgical text modifications through file editing

Model Flexibility — Choose Your Provider

The TARS stack is provider-agnostic. You can switch between model backends at runtime. Currently supported providers include Volcengine (Doubao-1.5, Seed series), Anthropic Claude (Claude 3.7 Sonnet+), and OpenAI (GPT-4o).

agent-tars --provider anthropic --model claude-3-7-sonnet-latest --apiKey your-api-key

Getting Started

You need Node.js 22+. Run with npx for a one-shot experience, or install globally:

# One-shot
npx @agent-tars/cli@latest

# Global install
npm install @agent-tars/cli@latest -g
agent-tars

UI-TARS Desktop: The desktop app has its own native installer for macOS and Windows, downloadable from the GitHub releases page — no Node.js setup needed.

Both projects are open source under the Apache 2.0 license.

Conclusion

ByteDance’s TARS stack represents a genuinely different approach to AI agents. Instead of bolting vision onto a text-only LLM, the team built a model that is natively visual — it sees screenshots the way we see computer screens, and it acts on that visual understanding directly.

The results on OSWorld and AndroidWorld benchmarks confirm this approach works. But the real test is in day-to-day use. The early feedback from the developer community — reflected in 36.8K GitHub stars and counting — suggests this stack is delivering on its promise.

Whether you are a developer looking for robust browser automation, a QA engineer wanting to automate desktop testing, or a researcher interested in vision-language models for GUI interaction, the TARS stack is worth your time. It is open source, backed by solid research, and actively developed by one of the most ambitious AI labs in the world.