Skip to content

Visual Agent — Coordinate-Based Delegation #15962

@gsquared94

Description

@gsquared94

Visual Agent — Coordinate-Based Delegation

Summary

Implement visual sub-loop using Gemini Computer Use model for screenshot-based interaction.

Description

When the semantic agent can't accomplish a task through the AX Tree, it delegates:

delegate_to_visual_agent({ instruction: 'Click the blue submit button' });

The visual sub-loop:

  1. Capture screenshot via Playwright
  2. Send to gemini-2.5-computer-use-preview model
  3. Execute visual tools: click_at(x, y), type_text_at, drag_and_drop, scroll_document
  4. Capture new screenshot for feedback
  5. Repeat until complete (max 5 steps)

Visual tools use Playwright's page.mouse and page.keyboard APIs directly, not MCP.

Acceptance Criteria

  • Visual sub-loop with separate model conversation
  • Screenshot capture with consistent coordinate system
  • Visual tool execution via Playwright
  • MCP cache invalidation after visual actions (UIDs become stale)
  • Max steps limit

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/agentIssues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Qualitykind/enhancementpriority/p2Important but can be addressed in a future release.status/bot-triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions