-
Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
- Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Erran Li
- 🏛️ Institutions: UCB, UIUC, Amazon
- 📅 Date: December 17, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [reinforcement learning], [skill discovery], [PAE]
- 📖 TLDR: This paper introduces the Proposer-Agent-Evaluator (PAE) system, enabling foundation model agents to autonomously discover and practice skills in real-world web environments. PAE comprises a context-aware task proposer, an agent policy for task execution, and a vision-language model-based success evaluator. Validated on vision-based web navigation tasks, PAE significantly enhances zero-shot generalization capabilities of vision-language model Internet agents, achieving over 30% relative improvement on unseen tasks and websites, and surpassing state-of-the-art open-source agents by more than 10%.
-
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining
- Zhiqi Ge, Juncheng Li, Xinglei Pang, Minghe Gao, Kaihang Pan, Wang Lin, Hao Fei, Wenqiao Zhang, Siliang Tang, Yueting Zhuang
- 🏛️ Institutions: Zhejiang University, NUS
- 📅 Date: December 13, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [framework], [Information-Sensitive Cropping], [Self-Refining Dual Learning], [visual grounding], [model]
- 📖 TLDR: This paper introduces Iris, a visual agent designed to enhance GUI automation by addressing challenges in high-resolution, complex digital environments. It employs two key innovations: Information-Sensitive Cropping (ISC), which dynamically identifies and prioritizes visually dense regions using an edge detection algorithm for efficient processing, and Self-Refining Dual Learning (SRDL), which enhances the agent's ability to handle complex tasks through a dual-learning loop that iteratively refines its performance without requiring additional annotated data. Empirical evaluations demonstrate that Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations, outperforming methods using ten times more training data.
-
AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations
- Gaurav Verma, Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Tucker Balch, Manuela Veloso
- 🏛️ Institutions: J.P. Morgan AI Research
- 📅 Date: November 24, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [few-shot learning], [meta-learning], [AdaptAgent]
- 📖 TLDR: This paper introduces AdaptAgent, a framework that enables multimodal web agents to adapt to new websites and domains using few human demonstrations (up to 2). The approach enhances agents' adaptability beyond large-scale pre-training and fine-tuning by leveraging in-context learning and meta-learning techniques. Experiments on benchmarks like Mind2Web and VisualWebArena show that incorporating minimal human demonstrations boosts task success rates significantly, highlighting the effectiveness of multimodal demonstrations over text-only ones and the impact of data selection strategies during meta-learning on agent generalization.
-
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
- Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Xinyue Yang, Jiadai Sun, Yu Yang, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, Yuxiao Dong
- 🏛️ Institutions: Tsinghua University, BAAI
- 📅 Date: November 4, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [reinforcement learning], [self-evolving curriculum], [WebRL], [outcome-supervised reward model]
- 📖 TLDR: This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open large language models (LLMs). WebRL addresses challenges such as the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. It incorporates a self-evolving curriculum that generates new tasks from unsuccessful attempts, a robust outcome-supervised reward model (ORM), and adaptive reinforcement learning strategies to ensure consistent improvements. Applied to Llama-3.1 and GLM-4 models, WebRL significantly enhances their performance on web-based tasks, surpassing existing state-of-the-art web agents.
-
- Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Hongming Zhang, Tianqing Fang, Zhenzhong Lan, Dong Yu
- 🏛️ Institutions: Zhejiang University, Tencent AI Lab, Westlake University
- 📅 Date: October 25, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [learning], [imitation learning], [exploration], [AI feedback]
- 📖 TLDR: The paper presents OpenWebVoyager, an open-source framework for training web agents that explore real-world online environments autonomously. The framework employs a cycle of exploration, feedback, and optimization, enhancing agent capabilities through multimodal perception and iterative learning. Initial skills are acquired through imitation learning, followed by real-world exploration, where the agent’s performance is evaluated and refined through feedback loops.
-
AutoGLM: Autonomous Foundation Agents for GUIs
- Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, Jie Tang
- 🏛️ Institutions: Zhipu AI, Tsinghua University
- 📅 Date: October 25, 2024
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [framework], [model], [learning], [AutoGLM]
- 📖 TLDR: This paper introduces AutoGLM, a new series in the ChatGLM family, designed as foundation agents for autonomous control of digital devices through GUIs. It addresses the challenges foundation models face in decision-making within dynamic environments by developing agents capable of learning through autonomous interactions. Focusing on web browsers and Android devices, AutoGLM integrates various techniques to create deployable agent systems. Key insights include the importance of designing an appropriate "intermediate interface" for GUI control and a novel progressive training framework for self-evolving online curriculum reinforcement learning. Evaluations demonstrate AutoGLM's effectiveness across multiple domains, achieving notable success rates in web browsing and Android device control tasks.
-
AutoWebGLM: A Large Language Model-based Web Navigating Agent
- Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang
- 🏛️ Institutions: THU, OSU
- 📅 Date: October 12, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [dataset], [benchmark], [reinforcement learning]
- 📖 TLDR: AutoWebGLM introduces a web navigation agent based on ChatGLM3-6B, designed to autonomously navigate and interact with webpages for complex tasks. The paper highlights a two-phase data construction approach using a hybrid human-AI methodology for diverse, curriculum-based web task training. It also presents AutoWebBench, a benchmark for evaluating agent performance in web tasks, and uses reinforcement learning to fine-tune operations, addressing complex webpage interaction and grounding.
-
ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning
- Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, Zhou Yu
- 🏛️ Institutions: Columbia Univ., MSR
- 📅 Date: Oct 2, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [learning], [R-MCTS], [Exploratory Learning], [VisualWebArena]
- 📖 TLDR: This paper introduces ExACT, an approach that combines Reflective Monte Carlo Tree Search (R-MCTS) and Exploratory Learning to enhance AI agents' exploration and decision-making capabilities in complex web environments. R-MCTS incorporates contrastive reflection and multi-agent debate for improved search efficiency and reliable state evaluation. Evaluated on the VisualWebArena benchmark, the GPT-4o-based R-MCTS agent demonstrates significant performance improvements over previous state-of-the-art methods. Additionally, knowledge gained from test-time search is effectively transferred back to GPT-4o through fine-tuning, enabling the model to explore, evaluate, and backtrack without external search algorithms, achieving 87% of R-MCTS's performance with reduced computational resources.
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
- Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
- 🏛️ Institutions: MultiOn, Stanford
- 📅 Date: August 13, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [MCTS], [Tree Search], [DPO], [Reinforcement Learning]
- 📖 TLDR: TBD
-
VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought
- Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki
- 🏛️ Institutions: CMU, Google DeepMind
- 📅 Date: June 20, 2024
- 📑 Publisher: NeurIPS 2024
- 💻 Env: [GUI]
- 🔑 Key: [framework], [memory], [in-context learning], [ICAL]
- 📖 TLDR: This paper introduces In-Context Abstraction Learning (ICAL), a method enabling Vision-Language Models (VLMs) to generate their own examples from sub-optimal demonstrations and human feedback. By abstracting trajectories into generalized programs of thought, ICAL enhances decision-making in retrieval-augmented LLM and VLM agents, reducing reliance on manual prompt engineering and improving performance across various tasks.
-
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
- Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar
- 🏛️ Institutions: UC Berkeley, UIUC, Google DeepMind
- 📅 Date: June 14, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [reinforcement learning], [DigiRL]
- 📖 TLDR: The authors present DigiRL, an autonomous reinforcement learning approach for training device-control agents. By fine-tuning a pre-trained vision-language model in two stages—offline and offline-to-online RL—DigiRL achieves a significant improvement in success rates on the Android-in-the-Wild dataset, establishing a new state-of-the-art for digital agents in device control.
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
- Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva
- 🏛️ Institutions: Google DeepMind, Google
- 📅 Date: May 23, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [benchmark], [Android-based agents], [task diversity], [reinforcement learning], [dynamic environment]
- 📖 TLDR: AndroidWorld introduces a dynamic Android environment for benchmarking autonomous agents across 116 tasks spanning 20 Android apps. These tasks vary through parameterized and natural language prompts, fostering a realistic testing ground for agents designed to operate in complex mobile environments. The benchmark supports millions of task variations, allowing agents to respond to the Android system's changing states and improving real-world applicability.
-
- Lucas-Andreï Thil, Mirela Popa, Gerasimos Spanakis
- 🏛️ Institutions: Maastricht University the Netherlands
- 📅 Date: May 1, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [large language models], [reinforcement learning]
- 📖 TLDR: This paper proposes a novel approach combining supervised learning (SL) and reinforcement learning (RL) techniques to train web navigation agents using large language models. The authors address limitations in previous models' understanding of HTML content and introduce methods to enhance true comprehension. Their approach, evaluated on the MiniWoB benchmark, outperforms previous SL methods on certain tasks using less data and narrows the performance gap with RL models. The study achieves 43.58% average accuracy in SL and 36.69% when combined with a multimodal RL approach, setting a new direction for future web navigation research.
-
- Moghis Fereidouni, A.B. Siddique
- 🏛️ Institutions: University of Kentucky
- 📅 Date: April 16, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Web]
- 🔑 Key: [framework], [reinforcement learning], [grounded language agent], [Flan-T5], [unsupervised domain adaptation]
- 📖 TLDR: This paper introduces GLAINTEL, a grounded language agent framework designed to enhance web interaction using instruction-finetuned language models, particularly Flan-T5, with reinforcement learning (PPO) to tackle interactive web navigation challenges. The study explores unsupervised and supervised training methods, evaluating the effects of human demonstration on agent performance. Results indicate that combining human feedback with reinforcement learning yields effective outcomes, rivaling larger models like GPT-4 on web navigation tasks.
-
AgentStudio: A Toolkit for Building General Virtual Agents
- Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, Shuicheng Yan
- 🏛️ Institutions: NTU, Skywork AI, ETH Zurich
- 📅 Date: March 26, 2024
- 📑 Publisher: arXiv
- 💻 Env: [Desktop]
- 🔑 Key: [framework], [dataset], [general virtual agents], [open-ended learning], [tool creation], [GroundUI], [benchmark]
- 📖 TLDR: AgentStudio is a robust toolkit for developing virtual agents with versatile actions, such as GUI automation and code execution. It unifies real-world human-computer interactions across OS platforms and includes diverse observation and action spaces, facilitating comprehensive training and benchmarking in complex settings. The toolkit's flexibility promotes agent generalization across varied tasks, supporting tool creation and a multimodal interaction interface to advance agent adaptability and learning.
-
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
- Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, Lingpeng Kong
- 🏛️ Institutions: Shanghai AI Lab, East China Normal University, Princeton University, University of Hong Kong
- 📅 Date: February 12, 2024
- 📑 Publisher: ICLR 2024 Workshop LLMAgents
- 💻 Env: [Desktop]
- 🔑 Key: [framework], [self-directed learning], [GAIA], [FRIDAY], [OS-Copilot]
- 📖 TLDR: The OS-Copilot framework supports building generalist agents capable of performing diverse tasks across an operating system (OS). This work introduces FRIDAY, an embodied agent using OS-Copilot to self-improve by learning from task outcomes. It operates with a memory-based architecture to tackle OS-level tasks across applications like terminals, web browsers, and third-party tools. Tested on the GAIA benchmark, FRIDAY achieved 35% higher performance than prior methods, proving effective in adapting to unfamiliar applications and refining its capabilities with minimal guidance.
-
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API
- Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, Yan Lu
- 🏛️ Institutions: MSRA
- 📅 Date: October 7, 2023
- 📑 Publisher: arXiv
- 💻 Env: [GUI]
- 🔑 Key: [model], [framework], [reinforcement learning], [UI task automation], [instruction grounding]
- 📖 TLDR: This paper introduces a multimodal model, termed RUIG (Reinforced UI Instruction Grounding), for automating UI tasks through natural language instructions. By leveraging a pixel-to-sequence approach, the model directly decodes UI element locations from screenshots based on user commands, removing the need for metadata like element coordinates. The framework uses a transformer-based encoder-decoder setup optimized through reinforcement learning to improve spatial accuracy. This novel approach outperforms prior methods, offering a generalized solution for UI task automation.
-
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
- Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust
- 🏛️ Institutions: Google DeepMind, The University of Tokyo
- 📅 Date: July 2023
- 📑 Publisher: ICLR 2024
- 💻 Env: [Web]
- 🔑 Key: [framework], [program synthesis], [HTML comprehension], [web automation], [self-supervised learning]
- 📖 TLDR: WebAgent leverages two LLMs—HTML-T5 for HTML comprehension and Flan-U-PaLM for program synthesis—to complete web automation tasks. It combines planning, HTML summarization, and code generation to navigate and interact with real-world web environments, improving success rates on HTML-based tasks and achieving state-of-the-art performance in benchmarks like MiniWoB and Mind2Web. The modular architecture adapts well to open-domain tasks, using local-global attention mechanisms to manage long HTML contexts.
-
Augmenting Autotelic Agents with Large Language Models
- Cédric Colas, Laetitia Teodorescu, Pierre-Yves Oudeyer, Xingdi Yuan, Marc-Alexandre Côté
- 🏛️ Institutions: MIT, Inria, Microsoft
- 📅 Date: May 22, 2023
- 📑 Publisher: CoLLAs 2023
- 💻 Env: [GUI]
- 🔑 Key: [framework], [reinforcement learning], [goal generation], [large language models], [autotelic learning]
- 📖 TLDR: This study introduces the Language Model-Augmented Autotelic Agent (LMA3), a framework leveraging large language models to help agents autonomously generate, represent, and learn diverse goals in a task-agnostic, text-based environment. LMA3 integrates pretrained language models to emulate human cultural knowledge, aiming to dynamically relabel goals, generate new goals, and create goal-driven reward functions without manual inputs. This approach supports skill development by autonomously expanding goal repertoires in ways that resemble human open-ended learning, showcasing potential for achieving complex, self-directed learning in AI.
-
Reflexion: Language Agents with Verbal Reinforcement Learning
- Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao
- 🏛️ Institutions: Northeastern University, MIT, Princeton University
- 📅 Date: March 20, 2023
- 📑 Publisher: NeurIPS 2023
- 💻 Env: [Misc]
- 🔑 Key: [framework], [learning], [verbal reinforcement learning], [Reflexion]
- 📖 TLDR: This paper introduces Reflexion, a framework that enhances language agents by enabling them to reflect on task feedback linguistically, storing these reflections in an episodic memory to improve decision-making in future trials. Reflexion allows agents to learn from various feedback types without traditional weight updates, achieving significant performance improvements across tasks like decision-making, coding, and reasoning. For instance, Reflexion attains a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4's 80%.
-
A Data-Driven Approach for Learning to Control Computers
- Peter C. Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Alex Goldin, Adam Santoro, Timothy Lillicrap
- 🏛️ Institutions: DeepMind
- 📅 Date: February 16, 2022
- 📑 Publisher: ICML 2022
- 💻 Env: [Desktop]
- 🔑 Key: [dataset], [framework], [computer control], [reinforcement learning], [multimodal transformer]
- 📖 TLDR: This study presents a reinforcement learning-based approach to train agents for computer control tasks, using keyboard and mouse interactions guided by natural language. By leveraging human demonstration data, agents trained in this environment achieved strong cross-task generalization across the MiniWob++ benchmark. This framework demonstrates how agents can control computers as humans would, enabling enhanced performance in complex computer tasks with high transferability.
-
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
- Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, Yang Li
- 🏛️ Institutions: University of Toronto
- 📅 Date: August 6, 2021
- 📑 Publisher: UIST 2021
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [dataset], [mobile UI summarization], [multimodal learning], [Transformer model]
- 📖 TLDR: The paper introduces Screen2Words, an approach that utilizes multimodal learning to generate descriptive language summaries for mobile UI screens, combining textual, visual, and structural data from screens. The study created a large-scale dataset with 112,085 annotated screen summaries for 22,417 unique UIs, aiming to support model training for mobile UI understanding. The dataset facilitates a Transformer-based model trained to summarize screens by highlighting main functionalities, and the approach is validated with benchmarks in the mobile environment.
-
UIBert: Learning Generic Multimodal Representations for UI Understanding
- Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, Blaise Agüera y Arcas
- 🏛️ Institutions: Google Research
- 📅 Date: July 29, 2021
- 📑 Publisher: IJCAI 2021
- 💻 Env: [Mobile]
- 🔑 Key: [framework], [model], [dataset], [multimodal representation learning], [UI understanding]
- 📖 TLDR: This paper presents UIBert, a multimodal model aimed at understanding user interfaces (UIs) by combining visual, textual, and structural metadata. UIBert is designed for tasks such as component retrieval and expression resolution, using a transformer-based joint image-text model. The authors introduce five novel pre-training tasks to leverage UI-specific features, enhancing accessibility and task completion in mobile applications. UIBert demonstrates superior performance on nine downstream UI tasks, highlighting the potential of multimodal pre-training in UI understanding.
-
AndroidEnv: A Reinforcement Learning Platform for Android
- Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, Doina Precup
- 🏛️ Institutions: DeepMind
- 📅 Date: May 27, 2021
- 📑 Publisher: arXiv
- 💻 Env: [Mobile]
- 🔑 Key: [reinforcement learning], [Android interface], [RL environment], [task flexibility], [touchscreen action space]
- 📖 TLDR: AndroidEnv provides a reinforcement learning (RL) platform for Android that lets RL agents interact with a realistic Android simulation via touchscreen events. The platform supports diverse applications, enabling agents to interact with over 100 predefined tasks across a variety of apps. With hybrid continuous and discrete action spaces, AndroidEnv is well-suited for training agents in complex, real-world Android scenarios where actions must be contextually sequenced, such as in UI navigation, gaming, and productivity apps. This environment encourages further RL research by offering task flexibility and realistic Android emulation.
-
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration
- Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, Percy Liang
- 🏛️ Institutions: Stanford
- 📅 Date: February 24, 2018
- 📑 Publisher: ICLR 2018
- 💻 Env: [Web]
- 🔑 Key: [framework], [benchmark], [reinforcement learning], [web tasks], [workflow-guided exploration]
- 📖 TLDR: This paper presents a novel RL approach using workflow-guided exploration to efficiently train agents on web-based tasks, where actions are restricted based on demonstrated workflows to streamline learning. Evaluated on MiniWoB and MiniWoB++ benchmarks, the method significantly outperforms traditional RL techniques in sparse reward settings by structuring exploration according to high-level action constraints.
-
World of Bits: An Open-Domain Platform for Web-Based Agents
- Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, Percy Liang
- 🏛️ Institutions: Stanford, OpenAI
- 📅 Date: August 2017
- 📑 Publisher: ICML 2017
- 💻 Env: [Web]
- 🔑 Key: [framework], [dataset], [reinforcement learning], [open-domain]
- 📖 TLDR: This paper introduces World of Bits (WoB), a platform enabling agents to perform complex web-based tasks using low-level keyboard and mouse actions, addressing the lack of open-domain realism in existing reinforcement learning environments. WoB leverages a novel framework where crowdworkers create tasks with structured rewards and reproducibility by caching web interactions, forming a stable training environment. The authors validate WoB by training agents via behavioral cloning and reinforcement learning to accomplish various real-world tasks, showcasing its potential as an effective platform for reinforcement learning on web tasks.