Encountering Empty Patches and Severe Looping Issues while Testing the swe-bench_verified Dataset using Openhands Framework

I have been testing the swe-bench_verified dataset using the Seed-coder-8b-instruct model along with the Openhands framework. Out of 179 test cases, the accuracy achieved was only 3.9%, which is significantly lower than the reported 11.2% in the paper.

Key Findings:

Empty Patches: During the testing process, I observed that the Seed-coder model produced many empty patches, which seems to be the primary reason for the low solution rate. Out of the 179 test cases, only 7 were solved, 37 produced patches, and 138 did not produce any patches at all.

Looping Issues: Additionally, I encountered significant looping issues during inference. For instance, the system keeps repeating similar tasks without making progress, as shown in the following example:
{
"role": "assistant",
"content": "<function=execute_bash><parameter=command> grep -rn 'ORDER BY' /workspace/django__django__3.0/django/db/models/sql/compiler.py:86-100 </parameter> </function>"
},
{
"role": "user",
"content": "EXECUTION RESULT of [execute_bash]: grep: /workspace/django__django__3.0/django/db/models/sql/compiler.py:86-100: No such file or directory [The command completed with exit code 2.] [Current working directory: /workspace/django__django__3.0] [Python interpreter: /opt/miniconda3/envs/testbed/bin/python] [Command finished with exit code 2]"
},
{
"role": "assistant",
"content": "<function=execute_bash><parameter=command> grep -rn 'ORDER BY' /workspace/django__django__3.0/django/db/models/sql/compiler.py:257-300 </parameter> </function>"
},
{
"role": "user",
"content": "EXECUTION RESULT of [execute_bash]: grep: /workspace/django__django__3.0/django/db/models/sql/compiler.py:257-300: No such file or directory [The command completed with exit code 2.] [The command finished with exit code 2]"
},
{
"role": "assistant",
"content": "<function=execute_bash><parameter=command> grep -rn 'ORDER BY' /workspace/django__django__3.0/django/db/models/sql/compiler.py:359-400 </parameter> </function>"
}
Additional Observations:

Looping Issue: In total, 66 test cases out of the 179 displayed looping behavior during inference.

Excessive Inference Steps: 114 test cases exceeded 120 rounds of inference, with 5 test cases exhibiting both looping and excessive rounds.

Experimental Setup:
Openhands Framework: Maximum inference steps set to 120; using CodeActAgent with parallelism set to 5.

Seed-coder 7b: Temperature set to 0.0; using OpenAI parameter format and deployed locally with vllm. Maximum model length is 32768.

Could you kindly assist in resolving these issues or provide any insights on potential causes?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encountering Empty Patches and Severe Looping Issues while Testing the swe-bench_verified Dataset using Openhands Framework #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Encountering Empty Patches and Severe Looping Issues while Testing the swe-bench_verified Dataset using Openhands Framework #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions