Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Get empty git diff after inference although the agent did some work #228

Closed
kevin-support-bot bot opened this issue Jan 22, 2025 · 35 comments
Closed

Comments

@kevin-support-bot
Copy link

kevin-support-bot bot commented Jan 22, 2025

All-Hands-AI#6407 Issue


@tangken333, In matplotlib__matplotlib-25442, the agent stuck in a loop as the workspace is empty due to the symlink.

You can use this script to visualize the eval like this.

@tangken333
Copy link

So why the workspace is empty. What I can do for that? Thanks

@tangken333
Copy link

Besides, I found some diff is in a invalid format. Do you know what's wrong with that? Thanks!

@SmartManoj
Copy link
Owner

So why the workspace is empty. What I can do for that? Thanks

Check /testbed folder.

I found some diff is in a invalid format.

It shouldn't be. It is generated using git diff. Would you isolate that diff?

@tangken333
Copy link

Thanks for the reply. Could you please tell me more about how to solve the workspace empty problem? I don't know how to go inside the environment. And I would like to ask if this empty problem happened from the building of docker or from each running. Thanks!

@SmartManoj
Copy link
Owner

SmartManoj commented Jan 23, 2025

You could use docker desktop

Image

Maybe, there is a bug in the docker image. Here, the files are copied.

@tangken333
Copy link

I am testing on a Linux server, so might can not use the desktop.

So if I have already finished the Docker building, is there anything I can do now to save the test? I have tested on two different Linux servers, and both have the empty git diff problem. Is there any temperate patch I can do?

A little bit hurry for the testing. Please let me know. Thanks!

@SmartManoj
Copy link
Owner

SmartManoj commented Jan 23, 2025

You could run docker exec -it <container-id> /bin/bash to open the container terminal. docker ps -a to get the container id.

I am debugging that instance now.


is there anything I can do now to save the test

Which test?

temperate patch

u meant temporary?

@tangken333
Copy link

tangken333 commented Jan 23, 2025

You could run docker exec -it /bin/bash to open the container terminal. docker ps -a to get the container id.

Yeah I know how to run Docker, but the container will delete after running, so I cannot go into the environment now. As for I docker run bash, I can see the testbed there.

I am debugging that instance now.

Really thanks for your help!

Which test?

I mean my running on swe-bench

u meant temporary?

Yes, temporary (damn writing completion tools)

@SmartManoj
Copy link
Owner

but the container will delete after running

Is the keep_runtime_alive sandbox config set?

@tangken333
Copy link

Seems no.
All I do is:

  • clone the project and checkout to the CodeAct V2.1 version.
  • install the project by activate a conda and pip install .
  • run the command: ./evaluation/swe_bench/scripts/run_infer.sh llm.eval_o3 HEAD CodeActAgent 300 100 1 princeton-nlp/SWE-bench_Lite test

@SmartManoj
Copy link
Owner

Workaround: You can change the directory to /testbed here

@SmartManoj
Copy link
Owner

  • clone the project and checkout to the CodeAct V2.1 version.

Would you give the commit hash?

Works in the latest version.
Image

@tangken333
Copy link

tangken333 commented Jan 23, 2025

I am using the commit: 6498204
Not sure if this is a random problem. Cause it should not happen if you are always running and developing.

Workaround: You can change the directory to /testbed

This seems to works! I am running on that now.

@SmartManoj
Copy link
Owner

All-Hands-AI#5549 This change fixed that.

@tangken333
Copy link

All-Hands-AI#5549 This change fixed that.

I went through this bug fix! Before I merge this bugfix, all the git diffs are empty. After merging this bugfix, some of the git diffs are empty. Not sure they come from the same reason.

@SmartManoj
Copy link
Owner

All-Hands-AI#5659 Did you apply this one too?

@tangken333
Copy link

No I didn't try the 5659. I will try later. Thanks for your help!

SmartManoj added a commit that referenced this issue Jan 23, 2025
For #228
@SmartManoj
Copy link
Owner

SmartManoj commented Jan 23, 2025

I am using the commit: 6498204

added 9262560 on top of that to not remove the container and run only that instance;


@tangken333 Edit:
find /workspace/matplotlib__matplotlib__3.7 works after mv also.

Image

@SmartManoj
Copy link
Owner

I am using the commit: 6498204

Why did you choose this commit?

Original traj uses ls -R

@tangken333
Copy link

tangken333 commented Jan 23, 2025

Why did you choose this commit?

I am following the commitId from the metadata. And I check the version of CodeAct agent, it is V2.1 (correct).

Image

By the way, after I got the result, I face some new problems.

  1. If I directed run the evaluation script from you
    ./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh /root/OpenHands/evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite-test/CodeActAgent/gpt-4o-2024-08-06_maxiter_100_N_v2.1-no-hint-run_1/output.jsonl
    I got this error:
Detecting whether PROCESS_FILEPATH is in OH format or in SWE-bench format
==============================================================
The file IS NOT in SWE-bench format.
Merged output file with fine-grained report will be saved to /root/OpenHands/evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite-test/CodeActAgent/gpt-4o-2024-08-06_maxiter_100_N_v2.1-no-hint-run_1
Traceback (most recent call last):
  File "/root/Openhands_new/evaluation/benchmarks/swe_bench/scripts/eval/convert_oh_output_to_swe_json.py", line 6, in <module>
    from evaluation.benchmarks.swe_bench.eval_infer import process_git_patch
  File "/root/Openhands_new/evaluation/benchmarks/swe_bench/eval_infer.py", line 8, in <module>
    from swebench.harness.grading import get_eval_report
  File "/root/anaconda3/envs/openhands/lib/python3.12/site-packages/swebench/__init__.py", line 46, in <module>
    from swebench.harness.run_evaluation import (
  File "/root/anaconda3/envs/openhands/lib/python3.12/site-packages/swebench/harness/run_evaluation.py", line 48, in <module>
    from swebench.harness.modal_eval import (
  File "/root/anaconda3/envs/openhands/lib/python3.12/site-packages/swebench/harness/modal_eval/__init__.py", line 1, in <module>
    from swebench.harness.modal_eval.run_evaluation_modal import run_instances_modal
  File "/root/anaconda3/envs/openhands/lib/python3.12/site-packages/swebench/harness/modal_eval/run_evaluation_modal.py", line 206, in <module>
    image=swebench_image.add_local_file(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Image' object has no attribute 'add_local_file'. Did you mean: 'copy_local_file'?
SWEBENCH_FORMAT_JSONL: /root/OpenHands/evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite-test/CodeActAgent/gpt-4o-2024-08-06_maxiter_100_N_v2.1-no-hint-run_1/output.swebench.jsonl
Error: /root/OpenHands/evaluation/evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite-test/CodeActAgent/gpt-4o-2024-08-06_maxiter_100_N_v2.1-no-hint-run_1/output.swebench.jsonl does not exist. There is probably an error in the conversion process.
  1. Then I just copy the git diff from the output and set to SWE-bench evaluation format manually and run the evaluation by myself. I got some invalid patch (I am using GPT-4o):

Image
Image

After I changed the model from GPT-4o to Claude-3.5-S, test on three instances, I can still get 1 invalid patch, but better. Is that related to the model?

@SmartManoj
Copy link
Owner

SmartManoj commented Jan 24, 2025

I am following the commitId from the metadata.

Would you provide the URL?


Your swebench version?

Is there any difference between the two diffs? eval_infer.sh uses this script to convert to swe_bench format.

@tangken333
Copy link

Would you provide the URL?

All-Hands-AI#4537

Your swebench version?
swebench 3.0.4

Is there any difference between the two diffs? eval_infer.sh uses this script to convert to swe_bench format.

I don't think there are differences because I directly get the value from the dict. I give you the output.jsonl.

openhands_44_100_2.jsonl.zip

@SmartManoj
Copy link
Owner

SmartManoj commented Jan 24, 2025

Would you provide the URL?

All-Hands-AI#4537

outputs folder in .gitignore Would you give the direct link to JSON file?


swebench 3.0.4

Would you use this old SWE-bench version v2.0.13?

@tangken333
Copy link

outputs folder in .gitignore Would you give the direct link to JSON file?

Sorry I didn't get this. What do you mean

Would you use this old SWE-bench version v2.0.13?

Yes I will try now

@SmartManoj
Copy link
Owner

I am following the commitId from the metadata.

Would you give the direct link to the metadata.json file?

@tangken333
Copy link

I checkout to this commitId, and run by this command using the "HEAD"
./evaluation/swe_bench/scripts/run_infer.sh llm.eval_o3 HEAD CodeActAgent 300 100 1 princeton-nlp/SWE-bench_Lite test

@SmartManoj
Copy link
Owner

How did you get this image?

Image

@tangken333
Copy link

@tangken333
Copy link

I got the valid patch now, by changing the swebench version and running the script. Thanks!

@SmartManoj
Copy link
Owner

But the commit ID is ea2cca3 as mentioned here.

@tangken333
Copy link

tangken333 commented Jan 25, 2025

interesting, I don't know. But they both belong to one pull request: All-Hands-AI#4537, which is strong related to this task.

I guess there is not too much different except for some small bugfix?

@SmartManoj
Copy link
Owner

79 commits

@tangken333
Copy link

my bad, I will look at this also. But I am using a newer version. Should be better?

@SmartManoj
Copy link
Owner

Should be better?

Is the objective to just evaluate using GPT-4o? Why not verified-mini?

@tangken333
Copy link

Is the objective to just evaluate using GPT-4o?

I am only using 4o now for my test. poor qwq

Why not verified-mini?

Oh I just know that. Thanks I will take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants