Skip to content

Latest commit

 

History

History
82 lines (73 loc) · 8.48 KB

paper_evaluation.md

File metadata and controls

82 lines (73 loc) · 8.48 KB

Papers with Keyword: evaluation

  • MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control

    • Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee
    • 🏛️ Institutions: KAIST, UT at Austin
    • 📅 Date: October 23, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Mobile]
    • 🔑 Key: [benchmark], [safety], [evaluation], [Android emulator]
    • 📖 TLDR: MobileSafetyBench introduces a benchmark for evaluating the safety of large language model (LLM)-based autonomous agents in mobile device control. Using Android emulators, the benchmark simulates real-world tasks in apps such as messaging and banking to assess agents' safety and helpfulness. The safety-focused tasks test for privacy risk management and robustness against adversarial prompt injections. Experiments show agents perform well in helpful tasks but struggle with safety-related challenges, underscoring the need for continued advancements in mobile safety mechanisms for autonomous agents.
  • CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

    • Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li
    • 🏛️ Institutions: KAUST, UTokyo, CMU, Stanford, Harvard, Tsinghua University, SUSTech, Oxford
    • 📅 Date: July 3, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [GUI]
    • 🔑 Key: [benchmark], [framework], [evaluation], [CRAB]
    • 📖 TLDR: The authors present CRAB, a benchmark framework designed to evaluate Multimodal Language Model agents across multiple environments. It features a graph-based fine-grained evaluation method and supports automatic task generation, addressing limitations in existing benchmarks.
  • Identifying User Goals from UI Trajectories

    • Omri Berkovitch, Sapir Caduri, Noam Kahlon, Anatoly Efros, Avi Caciularu, Ido Dagan
    • 🏛️ Institutions: Google Research, Bar-Ilan University
    • 📅 Date: June 20, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [GUI]
    • 🔑 Key: [evaluation metric], [intent identification]
    • 📖 TLDR: This paper introduces the task of goal identification from observed UI trajectories, aiming to infer the user's intended task based on their GUI interactions. It proposes a novel evaluation metric to assess whether two task descriptions are paraphrases within a specific UI environment. Experiments utilizing the Android-In-The-Wild and Mind2Web datasets reveal that state-of-the-art models, such as GPT-4 and Gemini-1.5 Pro, underperform compared to humans, indicating significant room for improvement.
  • WebCanvas: Benchmarking Web Agents in Online Environments

    • Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu
    • 🏛️ Institutions: iMean AI, CMU
    • 📅 Date: June 18, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Web]
    • 🔑 Key: [framework], [dataset], [benchmark], [Mind2Web-Live], [key-node evaluation]
    • 📖 TLDR: This paper presents WebCanvas, an online evaluation framework for web agents designed to address the dynamic nature of web interactions. It introduces a key-node-based evaluation metric to capture critical actions or states necessary for task completion while disregarding noise from insignificant events or changed web elements. The framework includes the Mind2Web-Live dataset, a refined version of the original Mind2Web static dataset, containing 542 tasks with 2,439 intermediate evaluation states. Despite advancements, the best-performing model achieves a task success rate of 23.1%, highlighting substantial room for improvement.
  • LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation

    • Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi Gao, Yuanchun Li, Mengwei Xu
    • 🏛️ Institutions: BUPT, Tsinghua University
    • 📅 Date: April 12, 2024
    • 📑 Publisher: UIST 2024
    • 💻 Env: [Mobile]
    • 🔑 Key: [framework], [dataset], [benchmark], [UI automation], [mobile agent evaluation]
    • 📖 TLDR: LlamaTouch is an evaluation testbed designed for mobile UI automation, enabling reliable task assessment across 495 annotated tasks. It provides a scalable solution to evaluate agents in real-world mobile settings, comparing agent actions to essential UI states for accurate task completion. LlamaTouch supports dynamic environments, advancing mobile agent reliability and scalability in task automation.
  • Autonomous Evaluation and Refinement of Digital Agents

    • Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr
    • 🏛️ Institutions: UCB, UMich
    • 📅 Date: April 9, 2024
    • 📑 Publisher: COLM 2024
    • 💻 Env: [GUI]
    • 🔑 Key: [framework], [benchmark], [evaluation model], [domain transfer]
    • 📖 TLDR: This paper presents an autonomous evaluation framework for digital agents to enhance performance on web navigation and device control. The study introduces modular, cost-effective evaluators achieving up to 92.9% accuracy in benchmarks like WebArena and outlines their use in fine-tuning agents, improving state-of-the-art by 29% without additional supervision.
  • VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

    • Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried
    • 🏛️ Institutions: CMU
    • 📅 Date: January 24, 2024
    • 📑 Publisher: ACL 2024
    • 💻 Env: [Web]
    • 🔑 Key: [framework], [benchmark], [dataset], [multimodal agent evaluation], [visually grounded tasks]
    • 📖 TLDR: VisualWebArena is a benchmark designed for testing multimodal web agents on complex, visually grounded web tasks. It provides a reproducible framework with 910 task scenarios across real-world web applications, emphasizing open-ended, visually guided interactions. The tasks are modeled within a partially observable Markov decision process to assess agents’ capacity to interpret multimodal inputs, execute navigation, and accomplish user-defined objectives across complex visual and textual information on websites.
  • WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

    • Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, Dong Yu
    • 🏛️ Institutions: Zhejiang University, Tencent AI Lab, Westlake University
    • 📅 Date: January 24, 2024
    • 📑 Publisher: ACL 2024
    • 💻 Env: [Web]
    • 🔑 Key: [benchmark], [evaluation]
    • 📖 TLDR: This paper introduces WebVoyager, an innovative web agent powered by Large Multimodal Models (LMMs) that can complete user instructions end-to-end by interacting with real-world websites. The authors establish a new benchmark with tasks from 15 popular websites and propose an automatic evaluation protocol using GPT-4V. WebVoyager achieves a 59.1% task success rate, significantly outperforming GPT-4 (All Tools) and text-only setups. The study demonstrates the effectiveness of multimodal approaches in web automation and provides insights into developing more intelligent web interaction solutions.
  • AgentBench: Evaluating LLMs as Agents

    • Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang
    • 🏛️ Institutions: THU, OSU, ByteDance
    • 📅 Date: January 1, 2024
    • 📑 Publisher: ICLR 2024
    • 💻 Env: [GUI], [General]
    • 🔑 Key: [benchmark], [evaluation]
    • 📖 TLDR: AgentBench provides a comprehensive benchmark for evaluating LLMs as autonomous agents in various environments. It includes eight distinct scenarios, testing the LLMs' reasoning and decision-making capabilities in tasks such as OS interaction, database querying, knowledge graph traversal, and more. This benchmark compares the effectiveness of multiple commercial and open-source LLMs, revealing areas of improvement in instruction-following and long-term reasoning, essential for practical agent development.