-
Notifications
You must be signed in to change notification settings - Fork 153
Description
"Integration tests" we discussed in the context of this issue should be primarily be "tests that uses LLM and runs the system end to end" - maybe we should name them regression tests?
It should consists of two types:
-
Tests for "Agent behavior": using real-LLM, give the same instruction, and use a testcase to check if the agent is able to perform changes to the workspace to result in a desirable final state. We have integration tests for this at https://github.com/All-Hands-AI/OpenHands/tree/main/evaluation/integration_tests that runs daily and will comment the result to a github issue (workflow). We need to migrate it to this repository and also have it run daily with the same behavior.
-
Tests for "System behavior": using real-LLM to check whether the system components is working or not. Examples would be: checking if the
.reasoning_contentis successfully returned by the API for changes introduced in this PR: feat: Support reasoning content in Agent SDK #139