MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments

🛡️ MLA-Trust is a comprehensive and unified framework that evaluates the MLA trustworthiness across four principled dimensions: truthfulness, controllability, safety and privacy. The framework includes 34 high-risk interactive tasks to expose new trustworthiness challenges in GUI environments.

Truthfulness captures whether the agent correctly interprets visual or DOM-based elements on the GUI, and whether it produces factual outputs based on those perceptions.
Controllability assesses whether the agent introduces unnecessary steps, drifts from the intended goal, or triggers side effects not specified by the user.
Safety demonstrates whether the agent's actions are free from harmful or irreversible consequences, which encompasses the prevention of behaviors that cause financial loss, data corruption, or system failures.
Privacy evaluates whether the agent respects the confidentiality of sensitive information. MLAs often capture screenshots, handle form data, and interact with files.

🎯 Main Findings

🚨	Severe vulnerabilities in GUI environments: Both proprietary and open-source MLAs that interact with GUIs exhibit more severe trustworthiness risks compared to traditional MLLMs, particularly in high-stakes scenarios such as financial transactions.
🔄	Multi-step dynamic interactions amplify vulnerabilities: The transformation of MLLMs into GUI-based MLAs significantly compromises their trustworthiness. In multi-step interactive settings, these agents can execute harmful content that standalone MLLMs would typically reject.
⚡	Emergence of derived risks from iterative autonomy: Multi-step execution enhances adaptability but introduces latent and nonlinear risk accumulation across decision cycles, leading to unpredictable derived risks.
📈	Trustworthiness correlation: Open-source models employing structured fine-tuning strategies (e.g., SFT and RLHF) demonstrate improved controllability and safety. Larger models generally exhibit higher trustworthiness across multiple sub-aspects.

💻 Installation

Install uv by following the official installation guide. Ensure the PATH environment variable is configured as prompted.
Install dependencies:
```
uv sync
uv sync --extra flash-attn
```

📱 Mobile Setup

A. ADB Setup and Configuration

Reference: Mobile-Agent-E Repository

Install Android Debug Bridge (ADB)
- Windows: Download from Android Developer Platform Tools
- MacOS: brew install android-platform-tools
- Linux: sudo apt-get install android-tools-adb
Enable Developer Options
- Go to Settings → About phone
- Tap "MIUI version" multiple times until developer options are enabled (take Xiaomi for example)
- Navigate to Settings → Additional Settings → Developer options
Enable USB Debugging
- Enable "USB debugging" in Developer options
- Connect phone via USB cable
- Select "File Transfer" mode when prompted
Verify ADB Connection
```
## Check connected devices
adb devices
```

B. Task Preconditions

Modify scripts/mobile/adb.sh script for device setup
- Script functions: (a) Unlock device; (b) Return to home screen;
- Must execute before each task
- Customize according to your device specifications
Update ANDROID_SERIAL in scripts/mobile/run_task.sh to match your device

Our experimental equipment and operating system versions are as follows: (a) Device: Redmi Note 13 Pro; (b) Operating System: Xiaomi HyperOS 2.0.6.0

🌐 Website Setup

A. Task Preconditions

Since many tasks require a login to function properly, we provide cookie loading functionality to enable the agent to work correctly. You only need to run the following command (must be run on a machine with a visual web interface), then perform your login, and finally close the popup website to save cookies.

python src/scene/web/load_cookies.py

Then save the generated *.json files to src/scene/web/cookies

🌟 Quick Start

Configure environment variables

cp .env.template .env

Activate virtual environment

source .venv/bin/activate

Execute main task

bash scripts/mobile/run_task.sh
bash scripts/web/run_task.sh

Run evaluation

bash scripts/mobile/eval.sh
bash scripts/web/eval.sh

🚀 Supported Models

The following models are supported:

gpt-4o-2024-11-20
gpt-4-turbo
gemini-2.0-flash
gemini-2.0-pro-exp-02-05
claude-3-7-sonnet-20250219
llava-hf/llava-v1.6-mistral-7b-hf
lmms-lab/llava-onevision-qwen2-72b-ov-sft
lmms-lab/llava-onevision-qwen2-72b-ov-chat
microsoft/Magma-8B
Qwen/Qwen2.5-VL-7B-Instruct
deepseek-ai/deepseek-vl2
openbmb/MiniCPM-o-2_6
mistral-community/pixtral-12b
microsoft/Phi-4-multimodal-instruct
OpenGVLab/InternVL2-8B

📋 Task Overview

Our comprehensive task suite covers 34 high-risk interactive scenarios across multiple domains

🏆 Results

Performance ranking of different MLAs across trustworthiness dimensions

🤝 Acknowledgement

We acknowledge and thank the projects Mobile-Agent-E and SeeAct, whose foundational work has supported the development of this project.

📞 Contact

For questions, suggestions or collaboration opportunities, please contact us at jankinfmail@gmail.com, 52285904015@stu.ecnu.edu.cn, yangxiao19@tsinghua.org.cn

🌟 Citation

If you find this work useful, please consider citing our paper:

@article{yang2025mla,
  title={MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments},
  author={Yang, Xiao and Chen, Jiawei and Luo, Jun and Fang, Zhengwei and Dong, Yinpeng and Su, Hang and Zhu, Jun},
  journal={arXiv preprint arXiv:2506.01616},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
assets		assets
data		data
scripts		scripts
src		src
tests		tests
.env.template		.env.template
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments

🎯 Main Findings

💻 Installation

A. ADB Setup and Configuration

B. Task Preconditions

A. Task Preconditions

🌟 Quick Start

🚀 Supported Models

📋 Task Overview

🏆 Results

🤝 Acknowledgement

📞 Contact

🌟 Citation

About

Uh oh!

Releases 1

Packages

Contributors 3

Uh oh!

Languages

License

thu-ml/MLA-Trust

Folders and files

Latest commit

History

Repository files navigation

MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments

🎯 Main Findings

💻 Installation

A. ADB Setup and Configuration

B. Task Preconditions

A. Task Preconditions

🌟 Quick Start

🚀 Supported Models

📋 Task Overview

🏆 Results

🤝 Acknowledgement

📞 Contact

🌟 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Uh oh!

Languages

Packages