GroundCUA: Grounding Computer Use Agents on Human Demonstrations

🌐 Website | 📑 Paper | 🤗 Dataset | 🤖 Models

Authors

Aarash Feizi^1,2,4*, Shravan Nayak^1,3*,
Xiangru Jian⁵, Kevin Qinghong Lin⁶, Kaixin Li⁶, Rabiul Awal^1,3,4, Xing Han Lù^1,2, Johan Obando-Ceron^1,3, Juan A. Rodriguez^1,8, Nicolas Chapados⁴, David Vazquez⁴, Adriana Romero-Soriano^1,2, Reihaneh Rabbany^1,2,
Perouz Taslakian⁴, Christopher Pal⁴, Spandana Gella⁴, Sai Rajeswar^4,1,3

¹Mila - Quebec AI Institute, ²McGill University, ³Université de Montréal,
⁴ServiceNow Research, ⁵University of Waterloo, ⁶National University of Singapore,
⁷Polytechnique Montréal, ⁸École de Technologie Supérieure, ⁹CIFAR AI Chair

^*Equal contribution

Introduction

Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. We address this gap through:

GroundCUA Dataset: A large-scale, human-annotated desktop grounding dataset with 56K screenshots from over 10,000 real-world human tasks across 87 applications and 3.56M+ human-verified annotations
GroundNext Models: Vision-language models at 3B and 7B scales achieving state-of-the-art results across five benchmarks
Efficient Training: SOTA performance using one-tenth the training data of prior work

Key Features

🎯 High-Quality Desktop Dataset

Dense, expert-annotated screenshots with maximum annotation density
Coverage of almost every visible element, including small icons and controls
Fine-grained category information (menus, sidebars, etc.) for 50% of UI elements—fully open-source!

⚡ Efficient Model Training

State-of-the-art performance with 700K datapoints vs 9M+ in prior work
Two-stage training: supervised fine-tuning + reinforcement learning with fully open-source code
Models at 3B and 7B scales for efficiency and accuracy

🌐 Cross-Platform Generalization

Comprehensive evaluation on five challenging benchmarks
Robust generalization across desktop, mobile, and web environments despite training only on desktop data

🚀 Quick Start

Installation & Setup

To install from PyPI (recommended):

# Create and activate environment
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip  # optional

# Install PyTorch (adjust for CUDA version) and Flash Attention (for faster inference)
pip install torch torchvision
pip install flash-attn --no-build-isolation

# Install GroundCUA package for utilities
pip install groundcua  # basic dependencies
pip install groundcua[all] # full dependencies (optional)

Alternative: Install from Source

# Create and activate environment
conda create -n groundcua python=3.10 -y
conda activate groundcua

pip install --upgrade pip

# Clone repository
git clone https://github.com/ServiceNow/GroundCUA.git
cd GroundCUA

# Install PyTorch (adjust for your CUDA version)
pip install torch torchvision

# Install Flash Attention (recommended for faster inference)
pip install flash-attn --no-build-isolation

# Install in development mode
pip install -r requirements.txt

Quick GroundNext Model Inference

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import groundcua
import io
from urllib.request import urlopen

model_name = "ServiceNow/GroundNext-7B-V0"

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
    trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Configure generation
model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE
model.generation_config.do_sample = False
model.generation_config.use_cache = True

# Load and prepare image
url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png"
image = Image.open(io.BytesIO(urlopen(url).read()))
image, (width, height) = groundcua.prepare_image(image)

# Create messages and generate
instruction = "Click on the 'File' button"
messages = groundcua.create_messages(instruction, image, width, height)

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
# Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>

Updates

[Nov 11 2025] 🎉 We released our project webpage, the GroundCUA dataset, and the GroundNext-7B model!

Performance

Desktop Grounding Benchmarks

Model	ScreenSpot-Pro	OSWorld-G	UI-Vision	Avg
Qwen2.5-VL-7B	29.7	42.7	16.5	29.6
UI-TARS-72B	38.1	57.1	25.5	40.2
GroundNext-3B	49.8	64.2	62.1	58.7
GroundNext-7B	52.9	67.7	60.3	60.3

Cross-Platform Generalization

Model	MMBench-GUI	ScreenSpot-v2	Avg
Qwen2.5-VL-7B	33.9	88.8	61.4
UI-TARS-72B	74.3	90.3	82.3
GroundNext-3B	77.1	88.5	82.8
GroundNext-7B	81.1	90.4	85.8

Performance numbers demonstrate strong cross-domain (desktop, mobile and web) generalization despite training only on desktop data.

Agentic Performance on OSWorld

GroundNext models also demonstrate strong agentic capabilities when integrated with reasoning models. When combined with OpenAI o3, GroundNext-3B achieves competitive performance on OSWorld, matching or exceeding much larger models.

Model	OS	Office	Daily	Pro	Workflow	Overall
OpenAI o3	62.5	14.5	21.4	38.8	16.5	23.0
CUA	23.9	34.6	55.1	18.3	18.3	31.4
OpenCUA-7B	41.7	22.5	35.4	46.3	9.8	26.5
OpenCUA-72B	58.3	47.0	53.8	73.5	20.4	46.1
UI-TARS-1.5-7B	33.3	29.9	37.9	53.1	9.1	29.6
JEDI-7B w/ o3	50.0	46.1	61.9	75.5	35.3	51.0
GroundNext-3B w/ o3 (ours)	62.5	47.0	55.0	73.5	36.5	50.6

Task categories: OS (operating system tasks), Office (productivity applications), Daily (common user tasks), Pro (professional software), Workflow (multi-step workflows).

Key Results

Data Efficiency: Achieves SOTA with only 700K training examples vs 9M+ in prior work
Cross-Domain Excellence: Strong performance across desktop, mobile, and web despite desktop-only training
Fine-Grained Grounding: Superior performance on small UI elements and complex workflows

🎓 Training

🚧 Coming Soon: We are currently refining the training documentation and code. Complete training instructions, including supervised fine-tuning and reinforcement learning recipes, will be released in the training/ folder soon. Stay tuned!

Dataset

GroundCUA Dataset Overview

GroundCUA is a large-scale, human-annotated desktop grounding dataset with dense supervision:

📊 Scale: 56K annotated screenshots, 3.56M element annotations
🎯 Density: Maximum annotation density covering almost every visible UI element
✅ Quality: Human-verified annotations from trained experts
🖥️ Coverage: 87 desktop applications across 12 categories
📐 Resolution: High-resolution images (500K to 7M pixels)
🏷️ Categories: Fine-grained category information for 50% of elements

Dataset Access

Download the GroundCUA dataset:

pip install -U huggingface_hub
huggingface-cli download ServiceNow/GroundCUA --repo-type dataset --local-dir ./GroundCUA

📊 Evaluation

Our evaluation framework builds upon InfiGUI-G1 and provides comprehensive evaluation across multiple benchmarks.

Supported Benchmarks

ScreenSpot-Pro: Desktop element grounding
ScreenSpot-v2: Web and mobile interface grounding
MMBench-GUI: GUI understanding tasks
OSWorld-G: Operating system grounding
UI-Vision: Diverse desktop application grounding

Running Evaluations

cd eval/

# Evaluate on specific benchmark
python eval.py \
    --model_type qwen25vl \
    --model_name_or_path /path/to/trained/model \
    --benchmark screenspot \
    --data_path /path/to/benchmark/data \
    --output_dir results/

# Evaluate on all benchmarks
python eval.py \
    --model_type qwen25vl \
    --model_name_or_path /path/to/trained/model \
    --benchmark all \
    --task all \
    --language en

Evaluation Metrics

Accuracy: Precision of GUI element localization
Success Rate: Percentage of correctly grounded elements
Cross-Domain Performance: Generalization to unseen platforms
Fine-Grained Performance: Accuracy on small UI elements

Project Structure

GroundCUA/
├── README.md                    # This file
├── pyproject.toml              # Package configuration
├── PUBLISHING.md               # Guide for publishing to PyPI
├── assets/                      # Images and resources
├── groundcua/                  # Main package (pip installable)
│   ├── __init__.py             # Package initialization and utilities
│   └── version.py              # Version information
├── eval/                        # Evaluation framework
│   ├── eval.py                 # Main evaluation script
│   ├── data.py                 # Data loading utilities
│   ├── prompts.py              # Prompt processing
│   └── models/                 # Model implementations
└── training/                   # Training pipeline (documentation coming soon)

Acknowledgements

We thank the following projects and teams for their contributions to the open-source community:

InfiGUI-G1 for the evaluation framework foundation
LLaMA-Factory for the excellent SFT training framework
verl for the robust RL infrastructure
Qwen-2.5-VL for the foundation vision-language models
OpenCUA for design inspiration of repository
The computer use and GUI automation research community

Research Use and Disclaimer

GroundCUA is intended for research and educational purposes only.

Prohibited Uses

The model, dataset, and code may not be used for any purpose that violates applicable laws or regulations
Use for illegal, unethical, or harmful activities is strictly prohibited

Disclaimer

The authors and contributors are not responsible for any illegal, unethical, or harmful use
Users are solely responsible for ensuring compliance with applicable laws and regulations

Citation

If you use GroundCUA in your research, please cite our work:

@misc{feizi2025groundingcomputeruseagents,
      title={Grounding Computer Use Agents on Human Demonstrations}, 
      author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
      year={2025},
      eprint={2511.07332},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.07332}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GroundCUA: Grounding Computer Use Agents on Human Demonstrations

Authors

Introduction

Key Features

🚀 Quick Start

Installation & Setup

Quick GroundNext Model Inference

Updates

Performance

Desktop Grounding Benchmarks

Cross-Platform Generalization

Agentic Performance on OSWorld

Key Results

🎓 Training

Dataset

GroundCUA Dataset Overview

Dataset Access

📊 Evaluation

Supported Benchmarks

Running Evaluations

Evaluation Metrics

Project Structure

Acknowledgements

Research Use and Disclaimer

Prohibited Uses

Disclaimer

Citation

About

Uh oh!

Releases 1

Packages

Contributors 5

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
assets		assets
eval		eval
groundcua		groundcua
training		training
.gitignore		.gitignore
.gitmodules		.gitmodules
MANIFEST.in		MANIFEST.in
PUBLISHING.md		PUBLISHING.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

ServiceNow/GroundCUA

Folders and files

Latest commit

History

Repository files navigation

GroundCUA: Grounding Computer Use Agents on Human Demonstrations

Authors

Introduction

Key Features

🚀 Quick Start

Installation & Setup

Quick GroundNext Model Inference

Updates

Performance

Desktop Grounding Benchmarks

Cross-Platform Generalization

Agentic Performance on OSWorld

Key Results

🎓 Training

Dataset

GroundCUA Dataset Overview

Dataset Access

📊 Evaluation

Supported Benchmarks

Running Evaluations

Evaluation Metrics

Project Structure

Acknowledgements

Research Use and Disclaimer

Prohibited Uses

Disclaimer

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 5

Uh oh!

Languages

Packages