🛡️ MLA-Trust is a comprehensive and unified framework that evaluates the MLA trustworthiness across four principled dimensions: truthfulness, controllability, safety and privacy. The framework includes 34 high-risk interactive tasks to expose new trustworthiness challenges in GUI environments.
-
Truthfulness captures whether the agent correctly interprets visual or DOM-based elements on the GUI, and whether it produces factual outputs based on those perceptions.
-
Controllability assesses whether the agent introduces unnecessary steps, drifts from the intended goal, or triggers side effects not specified by the user.
-
Safety demonstrates whether the agent's actions are free from harmful or irreversible consequences, which encompasses the prevention of behaviors that cause financial loss, data corruption, or system failures.
-
Privacy evaluates whether the agent respects the confidentiality of sensitive information. MLAs often capture screenshots, handle form data, and interact with files.
| 🚨 | Severe vulnerabilities in GUI environments: Both proprietary and open-source MLAs that interact with GUIs exhibit more severe trustworthiness risks compared to traditional MLLMs, particularly in high-stakes scenarios such as financial transactions. |
| 🔄 | Multi-step dynamic interactions amplify vulnerabilities: The transformation of MLLMs into GUI-based MLAs significantly compromises their trustworthiness. In multi-step interactive settings, these agents can execute harmful content that standalone MLLMs would typically reject. |
| ⚡ | Emergence of derived risks from iterative autonomy: Multi-step execution enhances adaptability but introduces latent and nonlinear risk accumulation across decision cycles, leading to unpredictable derived risks. |
| 📈 | Trustworthiness correlation: Open-source models employing structured fine-tuning strategies (e.g., SFT and RLHF) demonstrate improved controllability and safety. Larger models generally exhibit higher trustworthiness across multiple sub-aspects. |
- Install
uvby following the official installation guide. Ensure the PATH environment variable is configured as prompted. - Install dependencies:
uv sync uv sync --extra flash-attn
📱 Mobile Setup
Reference: Mobile-Agent-E Repository
-
Install Android Debug Bridge (ADB)
- Windows: Download from Android Developer Platform Tools
- MacOS:
brew install android-platform-tools - Linux:
sudo apt-get install android-tools-adb
-
Enable Developer Options
- Go to Settings → About phone
- Tap "MIUI version" multiple times until developer options are enabled (take Xiaomi for example)
- Navigate to Settings → Additional Settings → Developer options
-
Enable USB Debugging
- Enable "USB debugging" in Developer options
- Connect phone via USB cable
- Select "File Transfer" mode when prompted
-
Verify ADB Connection
## Check connected devices adb devices
- Modify
scripts/mobile/adb.shscript for device setup- Script functions: (a) Unlock device; (b) Return to home screen;
- Must execute before each task
- Customize according to your device specifications
- Update ANDROID_SERIAL in
scripts/mobile/run_task.shto match your device
Our experimental equipment and operating system versions are as follows: (a) Device: Redmi Note 13 Pro; (b) Operating System: Xiaomi HyperOS 2.0.6.0
🌐 Website Setup
Since many tasks require a login to function properly, we provide cookie loading functionality to enable the agent to work correctly. You only need to run the following command (must be run on a machine with a visual web interface), then perform your login, and finally close the popup website to save cookies.
python src/scene/web/load_cookies.pyThen save the generated *.json files to src/scene/web/cookies
- Configure environment variables
cp .env.template .env- Activate virtual environment
source .venv/bin/activate- Execute main task
bash scripts/mobile/run_task.sh
bash scripts/web/run_task.sh- Run evaluation
bash scripts/mobile/eval.sh
bash scripts/web/eval.shThe following models are supported:
gpt-4o-2024-11-20gpt-4-turbogemini-2.0-flashgemini-2.0-pro-exp-02-05claude-3-7-sonnet-20250219llava-hf/llava-v1.6-mistral-7b-hflmms-lab/llava-onevision-qwen2-72b-ov-sftlmms-lab/llava-onevision-qwen2-72b-ov-chatmicrosoft/Magma-8BQwen/Qwen2.5-VL-7B-Instructdeepseek-ai/deepseek-vl2openbmb/MiniCPM-o-2_6mistral-community/pixtral-12bmicrosoft/Phi-4-multimodal-instructOpenGVLab/InternVL2-8B
We acknowledge and thank the projects Mobile-Agent-E and SeeAct, whose foundational work has supported the development of this project.
For questions, suggestions or collaboration opportunities, please contact us at jankinfmail@gmail.com, 52285904015@stu.ecnu.edu.cn, yangxiao19@tsinghua.org.cn
If you find this work useful, please consider citing our paper:
@article{yang2025mla,
title={MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments},
author={Yang, Xiao and Chen, Jiawei and Luo, Jun and Fang, Zhengwei and Dong, Yinpeng and Su, Hang and Zhu, Jun},
journal={arXiv preprint arXiv:2506.01616},
year={2025}
}

