AGUVIS

📑 Paper | 🌐 Project Page | 💾 AGUVIS Data Collection

Introduction

AGUVIS is a unified pure vision-based framework for autonomous GUI agents that can operate across various platforms (web, desktop, mobile). Unlike previous approaches that rely on textual representations, AGUVIS leverages unified purely vision-based observations and a consistent action space to ensure better generalization across different platforms.

Key Features & Contributions

🔍 Pure Vision Framework: First fully autonomous pure vision GUI agent capable of performing tasks independently without relying on closed-source models
🔄 Cross-Platform Unification: Unified action space and plugin system that works consistently across different GUI environments
📊 Comprehensive Dataset: Large-scale dataset of GUI agent trajectories with multimodal grounding and reasoning
🧠 Two-Stage Training: Novel training pipeline focusing on GUI grounding followed by planning and reasoning
💭 Inner Monologue: Explicit planning and reasoning capabilities integrated into the model training

Our framework demonstrates state-of-the-art performance in both offline and real-world online scenarios, offering a more efficient and generalizable approach to GUI automation.

overview.mp4

Mobile Tasks (Android World)

androidworld.mp4

Web Browsing Tasks (Mind2Web-Live)

mind2web-live.mp4

Computer-use Tasks (OSWorld)

osworld.mp4

Getting Started

Installation

Clone the repository:

git clone git@github.com:xlang-ai/aguvis.git
cd aguvis

Create and activate a conda environment:

conda create -n aguvis python=3.10
conda activate aguvis

Install PyTorch and dependencies:

conda install pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidia
pip install -e .

Data Preparation

Stage 1: Grounding
- Download the dataset from aguvis-stage1
- Place the data according to the structure defined in data/stage1.yaml
Stage 2: Planning and Reasoning
- Download the dataset from aguvis-stage2
- Place the data according to the structure defined in data/stage2.yaml

Training

Configure your training settings:
- Open scripts/train.sh
- Set the SFT_TASK variable to specify your training stage
Start training:

bash scripts/train.sh

Checklist

Data
- ✅ Stage 1: Grounding Dataset
- ✅ Stage 2: Planning and Reasoning Trajectories
Code
- ✅ Training Pipeline
- 🚧 Model Weights and Configurations
- 🚧 Inference Scripts
- 🚧 Evaluation Toolkit

Citation

If this work is helpful, please kindly cite as:

@article{xu2024aguvis,
  title={Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction},
  author={Yiheng Xu and Zekun Wang and Junli Wang and Dunjie Lu and Tianbao Xie and Amrita Saha and Doyen Sahoo and Tao Yu and Caiming Xiong},
  year={2024},
  url={https://arxiv.org/abs/2412.04454}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
scripts		scripts
src/aguvis		src/aguvis
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AGUVIS

Introduction

Key Features & Contributions

Mobile Tasks (Android World)

Web Browsing Tasks (Mind2Web-Live)

Computer-use Tasks (OSWorld)

Getting Started

Installation

Data Preparation

Training

Checklist

Citation

About

Releases

Packages

Contributors 3

Languages

xlang-ai/aguvis

Folders and files

Latest commit

History

Repository files navigation

AGUVIS

Introduction

Key Features & Contributions

Mobile Tasks (Android World)

Web Browsing Tasks (Mind2Web-Live)

Computer-use Tasks (OSWorld)

Getting Started

Installation

Data Preparation

Training

Checklist

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages