Skip to content

Add behavioral evaluations for planning tools and workflow #17169

@jerop

Description

@jerop

Add behavioral evals to verify that the agent correctly adheres to its restrictions to read-only tools. These tests validate that the model consistently refuses file modifications when in PLAN mode. Also include evals for EnterPlanMode and ExitPlanMode tools

This is dependent on refactor described in #17168.

Evals for AskUser tool is tracked in #17956.

Metadata

Metadata

Assignees

Labels

area/coreIssues related to User Interface, OS Support, Core Functionalityworkstream-rollupLabel used to tag epics and features that are associated with one of the three primary workstreams🔒 maintainer only⛔ Do not contribute. Internal roadmap item.

Type

Projects

Status

Closed

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions