-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Labels
area/coreIssues related to User Interface, OS Support, Core FunctionalityIssues related to User Interface, OS Support, Core Functionalityworkstream-rollupLabel used to tag epics and features that are associated with one of the three primary workstreamsLabel used to tag epics and features that are associated with one of the three primary workstreams🔒 maintainer only⛔ Do not contribute. Internal roadmap item.⛔ Do not contribute. Internal roadmap item.
Description
Add behavioral evals to verify that the agent correctly adheres to its restrictions to read-only tools. These tests validate that the model consistently refuses file modifications when in PLAN mode. Also include evals for EnterPlanMode and ExitPlanMode tools
This is dependent on refactor described in #17168.
Evals for AskUser tool is tracked in #17956.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/coreIssues related to User Interface, OS Support, Core FunctionalityIssues related to User Interface, OS Support, Core Functionalityworkstream-rollupLabel used to tag epics and features that are associated with one of the three primary workstreamsLabel used to tag epics and features that are associated with one of the three primary workstreams🔒 maintainer only⛔ Do not contribute. Internal roadmap item.⛔ Do not contribute. Internal roadmap item.
Type
Projects
Status
Closed