Back to ExploreModel Detail

UI-TARS 1.5 7B

Chat

ByteDance

bytedance/ui-tars-1.5-7b

UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, including desktop interfaces, web browsers, mobile systems, and games. Built by ByteDance, it builds upon the UI-TARS framework with reinforcement learning-based reasoning, enabling robust action planning and execution across virtual interfaces. This model achieves state-of-the-art results on a range of interactive and grounding benchmarks, including OSworld, WebVoyager, AndroidWorld, and ScreenSpot. It also demonstrates perfect task completion across diverse Poki games and outperforms prior models in Minecraft agent tasks. UI-TARS-1.5 supports thought decomposition during inference and shows strong scaling across variants, with the 1.5 version notably exceeding the performance of earlier 72B and 7B checkpoints.

4

credits / gen

Try this model
VisionFile SupportReasoning128K ContextVision (OR)

About this model

UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, including desktop interfaces, web browsers, mobile systems, and games. Built by ByteDance, it builds upon the UI-TARS framework with reinforcement learning-based reasoning, enabling robust action planning and execution across virtual interfaces. This model achieves state-of-the-art results on a range of interactive and grounding benchmarks, including OSworld, WebVoyager, AndroidWorld, and ScreenSpot. It also demonstrates perfect task completion across diverse Poki games and outperforms prior models in Minecraft agent tasks. UI-TARS-1.5 supports thought decomposition during inference and shows strong scaling across variants, with the 1.5 version notably exceeding the performance of earlier 72B and 7B checkpoints.

Technical Specifications

Provider

ByteDance

Type

Chat

Context Window

128,000 tokens

Pricing

4 credits

Knowledge Cutoff

2025-01-31

Supported Languages

en

Capabilities

Vision

Can process and understand images

File Support

Can read PDF, DOCX, XLSX & more

Reasoning

Chain-of-thought reasoning exposed

128K Context

Large context window for long documents

Vision (OR)

OpenRouter reports vision support