Skip to content

Latest commit

 

History

History
41 lines (34 loc) · 3.06 KB

README.md

File metadata and controls

41 lines (34 loc) · 3.06 KB

awesome-interface-agents

List of AI tools that can interact with user interfaces. PRs welcome.

Segmenters

Web browser

These are still mostly text-based

Open Source

Closed Source

Models

  • Claude 3.5 Computer Use (Oct 2024): Version of the Claude 3.5 model which supports computer use structured text and image tool inputs and actionable text outputs.
  • Llama 3.2 (Sep 2024): The two largest models of the Llama 3.2 collection, 11B and 90B, support image reasoning use cases, such as document-level understanding including charts and graphs, captioning of images, and visual grounding tasks such as directionally pinpointing objects in images based on natural language descriptions.
  • Molmo (Sep 2024): VLM that matches GPT-4V performance with pointing ability.
  • CogAgent (Dec 2023): CogAgent is an open-source visual language model that can identify regions and points of UIs to interact with.
  • Florence 2 (Nov 2023): Vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks including producing bounding boxes.

Complete solutions

Open Source

  • OpenAdapt.AI: AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
  • Skyvern: Browser automation software
  • ScreenAgent
  • Mobile-Agent
  • UI-ACT: An AI agent for interacting with a computer using the graphical user interface
  • OpenInterpreter: Uses code to interact with operating system.
  • AIOS: Can interact with operating system as backend.

Closed Source

  • Adept: Company looking to automate user interface interaction through ML

Papers