-
Notifications
You must be signed in to change notification settings - Fork 54
Description
I have been testing the swe-bench_verified dataset using the Seed-coder-8b-instruct model along with the Openhands framework. Out of 179 test cases, the accuracy achieved was only 3.9%, which is significantly lower than the reported 11.2% in the paper.
Key Findings:
Empty Patches: During the testing process, I observed that the Seed-coder model produced many empty patches, which seems to be the primary reason for the low solution rate. Out of the 179 test cases, only 7 were solved, 37 produced patches, and 138 did not produce any patches at all.
Looping Issues: Additionally, I encountered significant looping issues during inference. For instance, the system keeps repeating similar tasks without making progress, as shown in the following example:
{
"role": "assistant",
"content": "<function=execute_bash><parameter=command> grep -rn 'ORDER BY' /workspace/django__django__3.0/django/db/models/sql/compiler.py:86-100 "
},
{
"role": "user",
"content": "EXECUTION RESULT of [execute_bash]: grep: /workspace/django__django__3.0/django/db/models/sql/compiler.py:86-100: No such file or directory [The command completed with exit code 2.] [Current working directory: /workspace/django__django__3.0] [Python interpreter: /opt/miniconda3/envs/testbed/bin/python] [Command finished with exit code 2]"
},
{
"role": "assistant",
"content": "<function=execute_bash><parameter=command> grep -rn 'ORDER BY' /workspace/django__django__3.0/django/db/models/sql/compiler.py:257-300 "
},
{
"role": "user",
"content": "EXECUTION RESULT of [execute_bash]: grep: /workspace/django__django__3.0/django/db/models/sql/compiler.py:257-300: No such file or directory [The command completed with exit code 2.] [The command finished with exit code 2]"
},
{
"role": "assistant",
"content": "<function=execute_bash><parameter=command> grep -rn 'ORDER BY' /workspace/django__django__3.0/django/db/models/sql/compiler.py:359-400 "
}
Additional Observations:
Looping Issue: In total, 66 test cases out of the 179 displayed looping behavior during inference.
Excessive Inference Steps: 114 test cases exceeded 120 rounds of inference, with 5 test cases exhibiting both looping and excessive rounds.
Experimental Setup:
Openhands Framework: Maximum inference steps set to 120; using CodeActAgent with parallelism set to 5.
Seed-coder 7b: Temperature set to 0.0; using OpenAI parameter format and deployed locally with vllm. Maximum model length is 32768.
Could you kindly assist in resolving these issues or provide any insights on potential causes?