Skip to content

A toolbox for benchmarking Multimodal LLM Agents trustworthiness across truthfulness, controllability, safety and privacy dimensions through 34 interactive tasks

License

Notifications You must be signed in to change notification settings

thu-ml/MLA-Trust

Repository files navigation

MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments

Truthfulness Safety Controllability Privacy

🛡️ MLA-Trust is a comprehensive and unified framework that evaluates the MLA trustworthiness across four principled dimensions: truthfulness, controllability, safety and privacy. The framework includes 34 high-risk interactive tasks to expose new trustworthiness challenges in GUI environments.

Framework

  • Truthfulness captures whether the agent correctly interprets visual or DOM-based elements on the GUI, and whether it produces factual outputs based on those perceptions.

  • Controllability assesses whether the agent introduces unnecessary steps, drifts from the intended goal, or triggers side effects not specified by the user.

  • Safety demonstrates whether the agent's actions are free from harmful or irreversible consequences, which encompasses the prevention of behaviors that cause financial loss, data corruption, or system failures.

  • Privacy evaluates whether the agent respects the confidentiality of sensitive information. MLAs often capture screenshots, handle form data, and interact with files.

🎯 Main Findings

🚨 Severe vulnerabilities in GUI environments: Both proprietary and open-source MLAs that interact with GUIs exhibit more severe trustworthiness risks compared to traditional MLLMs, particularly in high-stakes scenarios such as financial transactions.
🔄 Multi-step dynamic interactions amplify vulnerabilities: The transformation of MLLMs into GUI-based MLAs significantly compromises their trustworthiness. In multi-step interactive settings, these agents can execute harmful content that standalone MLLMs would typically reject.
Emergence of derived risks from iterative autonomy: Multi-step execution enhances adaptability but introduces latent and nonlinear risk accumulation across decision cycles, leading to unpredictable derived risks.
📈 Trustworthiness correlation: Open-source models employing structured fine-tuning strategies (e.g., SFT and RLHF) demonstrate improved controllability and safety. Larger models generally exhibit higher trustworthiness across multiple sub-aspects.

💻 Installation

  1. Install uv by following the official installation guide. Ensure the PATH environment variable is configured as prompted.
  2. Install dependencies:
    uv sync
    uv sync --extra flash-attn
📱 Mobile Setup

A. ADB Setup and Configuration

Reference: Mobile-Agent-E Repository

  1. Install Android Debug Bridge (ADB)

  2. Enable Developer Options

    • Go to Settings → About phone
    • Tap "MIUI version" multiple times until developer options are enabled (take Xiaomi for example)
    • Navigate to Settings → Additional Settings → Developer options
  3. Enable USB Debugging

    • Enable "USB debugging" in Developer options
    • Connect phone via USB cable
    • Select "File Transfer" mode when prompted
  4. Verify ADB Connection

    ## Check connected devices
    adb devices

B. Task Preconditions

  1. Modify scripts/mobile/adb.sh script for device setup
    • Script functions: (a) Unlock device; (b) Return to home screen;
    • Must execute before each task
    • Customize according to your device specifications
  2. Update ANDROID_SERIAL in scripts/mobile/run_task.sh to match your device

Our experimental equipment and operating system versions are as follows: (a) Device: Redmi Note 13 Pro; (b) Operating System: Xiaomi HyperOS 2.0.6.0

🌐 Website Setup

A. Task Preconditions

Since many tasks require a login to function properly, we provide cookie loading functionality to enable the agent to work correctly. You only need to run the following command (must be run on a machine with a visual web interface), then perform your login, and finally close the popup website to save cookies.

python src/scene/web/load_cookies.py

Then save the generated *.json files to src/scene/web/cookies

🌟 Quick Start

  1. Configure environment variables
cp .env.template .env
  1. Activate virtual environment
source .venv/bin/activate
  1. Execute main task
bash scripts/mobile/run_task.sh
bash scripts/web/run_task.sh
  1. Run evaluation
bash scripts/mobile/eval.sh
bash scripts/web/eval.sh

🚀 Supported Models

The following models are supported:

  • gpt-4o-2024-11-20
  • gpt-4-turbo
  • gemini-2.0-flash
  • gemini-2.0-pro-exp-02-05
  • claude-3-7-sonnet-20250219
  • llava-hf/llava-v1.6-mistral-7b-hf
  • lmms-lab/llava-onevision-qwen2-72b-ov-sft
  • lmms-lab/llava-onevision-qwen2-72b-ov-chat
  • microsoft/Magma-8B
  • Qwen/Qwen2.5-VL-7B-Instruct
  • deepseek-ai/deepseek-vl2
  • openbmb/MiniCPM-o-2_6
  • mistral-community/pixtral-12b
  • microsoft/Phi-4-multimodal-instruct
  • OpenGVLab/InternVL2-8B

📋 Task Overview

Task List

Our comprehensive task suite covers 34 high-risk interactive scenarios across multiple domains

🏆 Results

Results

Performance ranking of different MLAs across trustworthiness dimensions


🤝 Acknowledgement

We acknowledge and thank the projects Mobile-Agent-E and SeeAct, whose foundational work has supported the development of this project.

📞 Contact

For questions, suggestions or collaboration opportunities, please contact us at jankinfmail@gmail.com, 52285904015@stu.ecnu.edu.cn, yangxiao19@tsinghua.org.cn

🌟 Citation

If you find this work useful, please consider citing our paper:

@article{yang2025mla,
  title={MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments},
  author={Yang, Xiao and Chen, Jiawei and Luo, Jun and Fang, Zhengwei and Dong, Yinpeng and Su, Hang and Zhu, Jun},
  journal={arXiv preprint arXiv:2506.01616},
  year={2025}
}

About

A toolbox for benchmarking Multimodal LLM Agents trustworthiness across truthfulness, controllability, safety and privacy dimensions through 34 interactive tasks

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •