Β Β π WebsiteΒ Β | Β Β π PaperΒ Β | Β Β π€ DatasetΒ Β | Β Β π€ ModelsΒ Β
Aarash Feizi1,2,4*, Shravan Nayak1,3*,
Xiangru Jian5, Kevin Qinghong Lin6, Kaixin Li6,
Rabiul Awal1,3,4, Xing Han LΓΉ1,2, Johan Obando-Ceron1,3, Juan A. Rodriguez1,8,
Nicolas Chapados4, David Vazquez4, Adriana Romero-Soriano1,2, Reihaneh Rabbany1,2,
Perouz Taslakian4, Christopher Pal4, Spandana Gella4, Sai Rajeswar4,1,3
1Mila - Quebec AI Institute, 2McGill University, 3UniversitΓ© de MontrΓ©al,
4ServiceNow Research, 5University of Waterloo, 6National University of Singapore,
7Polytechnique MontrΓ©al, 8Γcole de Technologie SupΓ©rieure, 9CIFAR AI Chair
*Equal contribution
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. We address this gap through:
- GroundCUA Dataset: A large-scale, human-annotated desktop grounding dataset with 56K screenshots from over 10,000 real-world human tasks across 87 applications and 3.56M+ human-verified annotations
- GroundNext Models: Vision-language models at 3B and 7B scales achieving state-of-the-art results across five benchmarks
- Efficient Training: SOTA performance using one-tenth the training data of prior work
π― High-Quality Desktop Dataset
- Dense, expert-annotated screenshots with maximum annotation density
- Coverage of almost every visible element, including small icons and controls
- Fine-grained category information (menus, sidebars, etc.) for 50% of UI elementsβfully open-source!
β‘ Efficient Model Training
- State-of-the-art performance with 700K datapoints vs 9M+ in prior work
- Two-stage training: supervised fine-tuning + reinforcement learning with fully open-source code
- Models at 3B and 7B scales for efficiency and accuracy
π Cross-Platform Generalization
- Comprehensive evaluation on five challenging benchmarks
- Robust generalization across desktop, mobile, and web environments despite training only on desktop data
To install from PyPI (recommended):
# Create and activate environment
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip # optional
# Install PyTorch (adjust for CUDA version) and Flash Attention (for faster inference)
pip install torch torchvision
pip install flash-attn --no-build-isolation
# Install GroundCUA package for utilities
pip install groundcua # basic dependencies
pip install groundcua[all] # full dependencies (optional)
Alternative: Install from Source
# Create and activate environment
conda create -n groundcua python=3.10 -y
conda activate groundcua
pip install --upgrade pip
# Clone repository
git clone https://github.com/ServiceNow/GroundCUA.git
cd GroundCUA
# Install PyTorch (adjust for your CUDA version)
pip install torch torchvision
# Install Flash Attention (recommended for faster inference)
pip install flash-attn --no-build-isolation
# Install in development mode
pip install -r requirements.txt
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import groundcua
import io
from urllib.request import urlopen
model_name = "ServiceNow/GroundNext-7B-V0"
# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Configure generation
model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE
model.generation_config.do_sample = False
model.generation_config.use_cache = True
# Load and prepare image
url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png"
image = Image.open(io.BytesIO(urlopen(url).read()))
image, (width, height) = groundcua.prepare_image(image)
# Create messages and generate
instruction = "Click on the 'File' button"
messages = groundcua.create_messages(instruction, image, width, height)
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
# Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>- [Nov 11 2025] π We released our project webpage, the GroundCUA dataset, and the GroundNext-7B model!
| Model | ScreenSpot-Pro | OSWorld-G | UI-Vision | Avg |
|---|---|---|---|---|
| Qwen2.5-VL-7B | 29.7 | 42.7 | 16.5 | 29.6 |
| UI-TARS-72B | 38.1 | 57.1 | 25.5 | 40.2 |
| GroundNext-3B | 49.8 | 64.2 | 62.1 | 58.7 |
| GroundNext-7B | 52.9 | 67.7 | 60.3 | 60.3 |
| Model | MMBench-GUI | ScreenSpot-v2 | Avg |
|---|---|---|---|
| Qwen2.5-VL-7B | 33.9 | 88.8 | 61.4 |
| UI-TARS-72B | 74.3 | 90.3 | 82.3 |
| GroundNext-3B | 77.1 | 88.5 | 82.8 |
| GroundNext-7B | 81.1 | 90.4 | 85.8 |
Performance numbers demonstrate strong cross-domain (desktop, mobile and web) generalization despite training only on desktop data.
GroundNext models also demonstrate strong agentic capabilities when integrated with reasoning models. When combined with OpenAI o3, GroundNext-3B achieves competitive performance on OSWorld, matching or exceeding much larger models.
| Model | OS | Office | Daily | Pro | Workflow | Overall |
|---|---|---|---|---|---|---|
| OpenAI o3 | 62.5 | 14.5 | 21.4 | 38.8 | 16.5 | 23.0 |
| CUA | 23.9 | 34.6 | 55.1 | 18.3 | 18.3 | 31.4 |
| OpenCUA-7B | 41.7 | 22.5 | 35.4 | 46.3 | 9.8 | 26.5 |
| OpenCUA-72B | 58.3 | 47.0 | 53.8 | 73.5 | 20.4 | 46.1 |
| UI-TARS-1.5-7B | 33.3 | 29.9 | 37.9 | 53.1 | 9.1 | 29.6 |
| JEDI-7B w/ o3 | 50.0 | 46.1 | 61.9 | 75.5 | 35.3 | 51.0 |
| GroundNext-3B w/ o3 (ours) | 62.5 | 47.0 | 55.0 | 73.5 | 36.5 | 50.6 |
Task categories: OS (operating system tasks), Office (productivity applications), Daily (common user tasks), Pro (professional software), Workflow (multi-step workflows).
- Data Efficiency: Achieves SOTA with only 700K training examples vs 9M+ in prior work
- Cross-Domain Excellence: Strong performance across desktop, mobile, and web despite desktop-only training
- Fine-Grained Grounding: Superior performance on small UI elements and complex workflows
training/ folder soon. Stay tuned!
GroundCUA is a large-scale, human-annotated desktop grounding dataset with dense supervision:
- π Scale: 56K annotated screenshots, 3.56M element annotations
- π― Density: Maximum annotation density covering almost every visible UI element
- β Quality: Human-verified annotations from trained experts
- π₯οΈ Coverage: 87 desktop applications across 12 categories
- π Resolution: High-resolution images (500K to 7M pixels)
- π·οΈ Categories: Fine-grained category information for 50% of elements
Download the GroundCUA dataset:
pip install -U huggingface_hub
huggingface-cli download ServiceNow/GroundCUA --repo-type dataset --local-dir ./GroundCUA- ScreenSpot-Pro: Desktop element grounding
- ScreenSpot-v2: Web and mobile interface grounding
- MMBench-GUI: GUI understanding tasks
- OSWorld-G: Operating system grounding
- UI-Vision: Diverse desktop application grounding
cd eval/
# Evaluate on specific benchmark
python eval.py \
--model_type qwen25vl \
--model_name_or_path /path/to/trained/model \
--benchmark screenspot \
--data_path /path/to/benchmark/data \
--output_dir results/
# Evaluate on all benchmarks
python eval.py \
--model_type qwen25vl \
--model_name_or_path /path/to/trained/model \
--benchmark all \
--task all \
--language en- Accuracy: Precision of GUI element localization
- Success Rate: Percentage of correctly grounded elements
- Cross-Domain Performance: Generalization to unseen platforms
- Fine-Grained Performance: Accuracy on small UI elements
GroundCUA/
βββ README.md # This file
βββ pyproject.toml # Package configuration
βββ PUBLISHING.md # Guide for publishing to PyPI
βββ assets/ # Images and resources
βββ groundcua/ # Main package (pip installable)
β βββ __init__.py # Package initialization and utilities
β βββ version.py # Version information
βββ eval/ # Evaluation framework
β βββ eval.py # Main evaluation script
β βββ data.py # Data loading utilities
β βββ prompts.py # Prompt processing
β βββ models/ # Model implementations
βββ training/ # Training pipeline (documentation coming soon)
We thank the following projects and teams for their contributions to the open-source community:
- InfiGUI-G1 for the evaluation framework foundation
- LLaMA-Factory for the excellent SFT training framework
- verl for the robust RL infrastructure
- Qwen-2.5-VL for the foundation vision-language models
- OpenCUA for design inspiration of repository
- The computer use and GUI automation research community
GroundCUA is intended for research and educational purposes only.
- The model, dataset, and code may not be used for any purpose that violates applicable laws or regulations
- Use for illegal, unethical, or harmful activities is strictly prohibited
- The authors and contributors are not responsible for any illegal, unethical, or harmful use
- Users are solely responsible for ensuring compliance with applicable laws and regulations
If you use GroundCUA in your research, please cite our work:
@misc{feizi2025groundingcomputeruseagents,
title={Grounding Computer Use Agents on Human Demonstrations},
author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han LΓΉ and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
year={2025},
eprint={2511.07332},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.07332},
}

