BEPA
From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

Zezhou Wang¹, Ziyun Zhang², Xiaoyi Zhang³, Zhuzhong Qian¹, Yan Lu³
¹Nanjing University, ²Peking University, ³Microsoft Research Asia

🌐 Website | 📑 arXiv (coming soon) | 🤖 Model | 🤗 Dataset (coming soon)

🏆 #1 Open-Source End-to-End Model on OSWorld (15 steps): Achieves 32.13% success rate, surpassing all open-source end-to-end models.
📊 Extreme Data Efficiency: Matches GUI-OWL-7B performance using only 128 training tasks.

📢 Updates

2026-01: We release the webpage and model BEPA-7B-S2. Check it out!

📖 TL;DR

We propose BEPA (Bi-Level Expert-to-Policy Assimilation), a framework that turns static expert traces into dynamic, policy-aligned guidance for GUI agents. BEPA improves UITARS1.5-7B from 22.87% to 32.13% on OSWorld-Verified (+9.26 points).

🔍 Introduction

Vision-language models are increasingly deployed as computer-use agents (CUAs) that operate desktops and browsers. Top-performing CUAs are framework-based systems that decompose planning and execution, while end-to-end screenshot-to-action policies are easier to deploy but lag behind on benchmarks such as OSWorld-Verified.

We ask: How can reinforcement learning from verifiable rewards (RLVR) best exploit a small pool of expert trajectories to train end-to-end policies?

Naively mixing these off-policy traces into on-policy RLVR is brittle due to:

Structural Mismatch: Framework traces interleave multiple roles (planning, execution, grounding) that end-to-end policies cannot directly imitate.
Distribution Gap: Even after format conversion, trajectories remain far from the base-policy manifold.

🚀 BEPA: Bi-Level Expert-to-Policy Assimilation

BEPA operates in two complementary stages:

LEVEL-1: Self-Rolled Execution

Transforms alien expert traces into policy-compatible trajectories. We abstract the expert trajectory into a compact natural-language plan, then let the base policy act in the environment with plan conditioning. This produces trajectories that lie much closer to the policy's manifold.

LEVEL-2: Self-Aligned Assimilation

Dynamically maintains a per-task cache, injecting guided trajectories into GRPO updates only upon total on-policy failure. The cache is continuously refreshed with the policy's own successful executions, ensuring the off-policy signal evolves alongside the agent.

📊 Main Results

BEPA achieves 32.13% success on OSWorld-Verified, improving over UITARS1.5-7B (22.87%) by +9.26 points (+40.5% relative) and over GRPO (23.60%) by +8.53 points (+36.1% relative).

Method	D_{expert_only}	D_train	D_{held_out}	Overall (%)
UITARS1.5-7B	18.52	55.12	5.74	22.87
GRPO	11.11	58.02	5.32	23.60
Trace Replacement	18.52	66.50	1.29	23.91
LUFFY	19.01	65.44	2.16	24.11
LEVEL-1	25.93	69.20	5.05	27.30
LEVEL-2	29.18	71.65	7.48	29.74
BEPA (ours)	35.19	73.23	10.30	32.13

🛠️ Verl-GUI: Training Framework

As part of this work, we release Verl-GUI, a highly scalable distributed training framework for long-horizon, multi-turn vision-language GUI agent training built upon veRL.

Key Features

Heterogeneous Cluster Architecture: Completely separates trainer and rollout into independent Ray clusters, enabling deployment across heterogeneous compute resources (IB/NVLink nodes for training, PCIe nodes for rollout).
Multiple Storage Backends: Supports Azure Blob Storage, NAS, and local filesystems through a unified abstraction layer.
Async Task Queue: Dynamically maintains a task queue for the rollout cluster to consume, enabling decoupled and non-blocking task processing.
K-round Rollout Processing: Intelligently splits batches across multiple rounds when trainer's global batch size exceeds rollout cluster capacity.
Scalable Parallel Environments: Number of concurrent environments scales with rollout cluster compute capacity, with Ray-based orchestration and automatic Docker cleanup.
Service-oriented Orchestration: Modular components including CheckpointManager, EnvWorkerPool, RolloutService, and ValidationAdapter.

📝 Citation

@misc{wang2026offpolicyonpolicyenhancinggui,
      title={From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation}, 
      author={Zezhou Wang and Ziyun Zhang and Xiaoyi Zhang and Zhuzhong Qian and Yan Lu},
      year={2026},
      eprint={2601.05787},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.05787}, 
}

📄 License

This project is released under the MIT License.

🙏 Acknowledgements

We thank the following open-source projects for making this work possible:

verl for the excellent RL framework.
vLLM for the fast inference engine.
OSWorld for the GUI agent benchmark.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BEPA
From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

📢 Updates

📖 TL;DR

🔍 Introduction

🚀 BEPA: Bi-Level Expert-to-Policy Assimilation

LEVEL-1: Self-Rolled Execution

LEVEL-2: Self-Aligned Assimilation

📊 Main Results

🛠️ Verl-GUI: Training Framework

Key Features

📝 Citation

📄 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

License

LEON-gittech/Verl_GUI

Folders and files

Latest commit

History

Repository files navigation

BEPA From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

📢 Updates

📖 TL;DR

🔍 Introduction

🚀 BEPA: Bi-Level Expert-to-Policy Assimilation

LEVEL-1: Self-Rolled Execution

LEVEL-2: Self-Aligned Assimilation

📊 Main Results

🛠️ Verl-GUI: Training Framework

Key Features

📝 Citation

📄 License

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

BEPA
From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

Packages