-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Open
Labels
area/agentIssues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent QualityIssues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Qualitykind/enhancementpriority/p2Important but can be addressed in a future release.Important but can be addressed in a future release.status/bot-triaged
Description
Visual Agent — Coordinate-Based Delegation
Summary
Implement visual sub-loop using Gemini Computer Use model for screenshot-based interaction.
Description
When the semantic agent can't accomplish a task through the AX Tree, it delegates:
delegate_to_visual_agent({ instruction: 'Click the blue submit button' });The visual sub-loop:
- Capture screenshot via Playwright
- Send to
gemini-2.5-computer-use-previewmodel - Execute visual tools:
click_at(x, y),type_text_at,drag_and_drop,scroll_document - Capture new screenshot for feedback
- Repeat until complete (max 5 steps)
Visual tools use Playwright's page.mouse and page.keyboard APIs directly, not MCP.
Acceptance Criteria
- Visual sub-loop with separate model conversation
- Screenshot capture with consistent coordinate system
- Visual tool execution via Playwright
- MCP cache invalidation after visual actions (UIDs become stale)
- Max steps limit
Metadata
Metadata
Assignees
Labels
area/agentIssues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent QualityIssues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Qualitykind/enhancementpriority/p2Important but can be addressed in a future release.Important but can be addressed in a future release.status/bot-triaged