Skip to content

ServiceNow/GroundCUA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GroundCUA Logo GroundCUA: Grounding Computer Use Agents on Human Demonstrations

  🌐 WebsiteΒ Β  | Β Β πŸ“‘ PaperΒ Β  | Β Β πŸ€— DatasetΒ Β  | Β Β πŸ€– ModelsΒ Β 

GroundCUA Overview

Authors

Aarash Feizi1,2,4*, Shravan Nayak1,3*,
Xiangru Jian5, Kevin Qinghong Lin6, Kaixin Li6, Rabiul Awal1,3,4, Xing Han LΓΉ1,2, Johan Obando-Ceron1,3, Juan A. Rodriguez1,8, Nicolas Chapados4, David Vazquez4, Adriana Romero-Soriano1,2, Reihaneh Rabbany1,2,
Perouz Taslakian4, Christopher Pal4, Spandana Gella4, Sai Rajeswar4,1,3

1Mila - Quebec AI Institute, 2McGill University, 3UniversitΓ© de MontrΓ©al,
4ServiceNow Research, 5University of Waterloo, 6National University of Singapore,
7Polytechnique MontrΓ©al, 8Γ‰cole de Technologie SupΓ©rieure, 9CIFAR AI Chair

*Equal contribution

Introduction

Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. We address this gap through:

  • GroundCUA Dataset: A large-scale, human-annotated desktop grounding dataset with 56K screenshots from over 10,000 real-world human tasks across 87 applications and 3.56M+ human-verified annotations
  • GroundNext Models: Vision-language models at 3B and 7B scales achieving state-of-the-art results across five benchmarks
  • Efficient Training: SOTA performance using one-tenth the training data of prior work

Key Features

🎯 High-Quality Desktop Dataset

  • Dense, expert-annotated screenshots with maximum annotation density
  • Coverage of almost every visible element, including small icons and controls
  • Fine-grained category information (menus, sidebars, etc.) for 50% of UI elementsβ€”fully open-source!

⚑ Efficient Model Training

  • State-of-the-art performance with 700K datapoints vs 9M+ in prior work
  • Two-stage training: supervised fine-tuning + reinforcement learning with fully open-source code
  • Models at 3B and 7B scales for efficiency and accuracy

🌐 Cross-Platform Generalization

  • Comprehensive evaluation on five challenging benchmarks
  • Robust generalization across desktop, mobile, and web environments despite training only on desktop data

πŸš€ Quick Start

Installation & Setup

To install from PyPI (recommended):

# Create and activate environment
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip  # optional

# Install PyTorch (adjust for CUDA version) and Flash Attention (for faster inference)
pip install torch torchvision
pip install flash-attn --no-build-isolation

# Install GroundCUA package for utilities
pip install groundcua  # basic dependencies
pip install groundcua[all] # full dependencies (optional)
Alternative: Install from Source
# Create and activate environment
conda create -n groundcua python=3.10 -y
conda activate groundcua

pip install --upgrade pip

# Clone repository
git clone https://github.com/ServiceNow/GroundCUA.git
cd GroundCUA

# Install PyTorch (adjust for your CUDA version)
pip install torch torchvision

# Install Flash Attention (recommended for faster inference)
pip install flash-attn --no-build-isolation

# Install in development mode
pip install -r requirements.txt

Quick GroundNext Model Inference

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import groundcua
import io
from urllib.request import urlopen

model_name = "ServiceNow/GroundNext-7B-V0"

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
    trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Configure generation
model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE
model.generation_config.do_sample = False
model.generation_config.use_cache = True

# Load and prepare image
url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png"
image = Image.open(io.BytesIO(urlopen(url).read()))
image, (width, height) = groundcua.prepare_image(image)

# Create messages and generate
instruction = "Click on the 'File' button"
messages = groundcua.create_messages(instruction, image, width, height)

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
# Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>

Updates

Performance

Desktop Grounding Benchmarks

Model ScreenSpot-Pro OSWorld-G UI-Vision Avg
Qwen2.5-VL-7B 29.7 42.7 16.5 29.6
UI-TARS-72B 38.1 57.1 25.5 40.2
GroundNext-3B 49.8 64.2 62.1 58.7
GroundNext-7B 52.9 67.7 60.3 60.3

Cross-Platform Generalization

Model MMBench-GUI ScreenSpot-v2 Avg
Qwen2.5-VL-7B 33.9 88.8 61.4
UI-TARS-72B 74.3 90.3 82.3
GroundNext-3B 77.1 88.5 82.8
GroundNext-7B 81.1 90.4 85.8

Performance numbers demonstrate strong cross-domain (desktop, mobile and web) generalization despite training only on desktop data.

Agentic Performance on OSWorld

GroundNext models also demonstrate strong agentic capabilities when integrated with reasoning models. When combined with OpenAI o3, GroundNext-3B achieves competitive performance on OSWorld, matching or exceeding much larger models.

Model OS Office Daily Pro Workflow Overall
OpenAI o3 62.5 14.5 21.4 38.8 16.5 23.0
CUA 23.9 34.6 55.1 18.3 18.3 31.4
OpenCUA-7B 41.7 22.5 35.4 46.3 9.8 26.5
OpenCUA-72B 58.3 47.0 53.8 73.5 20.4 46.1
UI-TARS-1.5-7B 33.3 29.9 37.9 53.1 9.1 29.6
JEDI-7B w/ o3 50.0 46.1 61.9 75.5 35.3 51.0
GroundNext-3B w/ o3 (ours) 62.5 47.0 55.0 73.5 36.5 50.6

Task categories: OS (operating system tasks), Office (productivity applications), Daily (common user tasks), Pro (professional software), Workflow (multi-step workflows).

Key Results

  • Data Efficiency: Achieves SOTA with only 700K training examples vs 9M+ in prior work
  • Cross-Domain Excellence: Strong performance across desktop, mobile, and web despite desktop-only training
  • Fine-Grained Grounding: Superior performance on small UI elements and complex workflows

πŸŽ“ Training

🚧 Coming Soon: We are currently refining the training documentation and code. Complete training instructions, including supervised fine-tuning and reinforcement learning recipes, will be released in the training/ folder soon. Stay tuned!

Dataset

GroundCUA Dataset Overview

GroundCUA is a large-scale, human-annotated desktop grounding dataset with dense supervision:

  • πŸ“Š Scale: 56K annotated screenshots, 3.56M element annotations
  • 🎯 Density: Maximum annotation density covering almost every visible UI element
  • βœ… Quality: Human-verified annotations from trained experts
  • πŸ–₯️ Coverage: 87 desktop applications across 12 categories
  • πŸ“ Resolution: High-resolution images (500K to 7M pixels)
  • 🏷️ Categories: Fine-grained category information for 50% of elements

Dataset Access

Download the GroundCUA dataset:

pip install -U huggingface_hub
huggingface-cli download ServiceNow/GroundCUA --repo-type dataset --local-dir ./GroundCUA

πŸ“Š Evaluation

Our evaluation framework builds upon InfiGUI-G1 and provides comprehensive evaluation across multiple benchmarks.

Supported Benchmarks

  • ScreenSpot-Pro: Desktop element grounding
  • ScreenSpot-v2: Web and mobile interface grounding
  • MMBench-GUI: GUI understanding tasks
  • OSWorld-G: Operating system grounding
  • UI-Vision: Diverse desktop application grounding

Running Evaluations

cd eval/

# Evaluate on specific benchmark
python eval.py \
    --model_type qwen25vl \
    --model_name_or_path /path/to/trained/model \
    --benchmark screenspot \
    --data_path /path/to/benchmark/data \
    --output_dir results/

# Evaluate on all benchmarks
python eval.py \
    --model_type qwen25vl \
    --model_name_or_path /path/to/trained/model \
    --benchmark all \
    --task all \
    --language en

Evaluation Metrics

  • Accuracy: Precision of GUI element localization
  • Success Rate: Percentage of correctly grounded elements
  • Cross-Domain Performance: Generalization to unseen platforms
  • Fine-Grained Performance: Accuracy on small UI elements

Project Structure

GroundCUA/
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ pyproject.toml              # Package configuration
β”œβ”€β”€ PUBLISHING.md               # Guide for publishing to PyPI
β”œβ”€β”€ assets/                      # Images and resources
β”œβ”€β”€ groundcua/                  # Main package (pip installable)
β”‚   β”œβ”€β”€ __init__.py             # Package initialization and utilities
β”‚   └── version.py              # Version information
β”œβ”€β”€ eval/                        # Evaluation framework
β”‚   β”œβ”€β”€ eval.py                 # Main evaluation script
β”‚   β”œβ”€β”€ data.py                 # Data loading utilities
β”‚   β”œβ”€β”€ prompts.py              # Prompt processing
β”‚   └── models/                 # Model implementations
└── training/                   # Training pipeline (documentation coming soon)

Acknowledgements

We thank the following projects and teams for their contributions to the open-source community:

  • InfiGUI-G1 for the evaluation framework foundation
  • LLaMA-Factory for the excellent SFT training framework
  • verl for the robust RL infrastructure
  • Qwen-2.5-VL for the foundation vision-language models
  • OpenCUA for design inspiration of repository
  • The computer use and GUI automation research community

Research Use and Disclaimer

GroundCUA is intended for research and educational purposes only.

Prohibited Uses

  • The model, dataset, and code may not be used for any purpose that violates applicable laws or regulations
  • Use for illegal, unethical, or harmful activities is strictly prohibited

Disclaimer

  • The authors and contributors are not responsible for any illegal, unethical, or harmful use
  • Users are solely responsible for ensuring compliance with applicable laws and regulations

Citation

If you use GroundCUA in your research, please cite our work:

@misc{feizi2025groundingcomputeruseagents,
      title={Grounding Computer Use Agents on Human Demonstrations}, 
      author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han LΓΉ and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
      year={2025},
      eprint={2511.07332},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.07332}, 
}

About

GroundCUA

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 5