Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Added FileEditAction to enable edits using diff format. #3777

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

RajWorking
Copy link
Contributor

@RajWorking RajWorking commented Sep 8, 2024

Short description of the problem this fixes or functionality that this introduces. This may be used for the CHANGELOG
We want to switch from our current agenskills style (similar to https://aider.chat/docs/benchmarks.html#diff-func) to Aider's diff block (https://aider.chat/docs/benchmarks.html#diff) and how that would improve the editing performance.


Give a summary of what the PR does, explaining any non-trivial design decisions

This PR adds a new EditAction, with logic to parse these output in diff format in the CodeActParser (translating the diff into EditAction).
The EditAction is then executed inside the runtime (for now, we simply parse the diff format into search and replace blocks and manually call the existing agentskills to perform the edit)

We use the <execute_edit> tags to enclose the diff format. For example,

<execute_edit>
demo.py
<<<<<<< SEARCH
    print("hello")
=======
    print("goodbye")
>>>>>>> REPLACE
</execute_edit>

Link of any specific issues this addresses
#3650

@tobitege
Copy link
Collaborator

tobitege commented Sep 8, 2024

Hey Raj,
thanks so much for taking this on!
Got a few questions, though:

  • What is the reasoning behind moving this into an action/observation instead of keeping it in the agentskills? There seems to be a lot of overhead for a comparatively small amount of added lines that implement the action.
  • The actual action then still uses the same agentskill methods to do the actual operation. What's the difference here to how Aider does its implementation?
  • One problem with having it in the client runtime is that it then also needs to be ported into the remote runtime.

These are just first questions/thoughts. Do you have other things still on a to-do list (besides integration test mocks 😬) ?

@RajWorking
Copy link
Contributor Author

Valid question @tobitege, I also thought so initially.

The primary motive here is to get the agent to produce the edit in a diff-format similar to aider (simply because LLMs are trained to do that better). This means we need to modify the prompt and get the agent to output the diff wrapped in <execute_edit> tag.
From there on, actions/observation is the only way to actuate the edit using the runtime.

@xingyaoww Can probably explain this better?

@tobitege
Copy link
Collaborator

tobitege commented Sep 8, 2024

Valid question @tobitege, I also thought so initially.

The primary motive here is to get the agent to produce the edit in a diff-format similar to aider (simply because LLMs are trained to do that better). This means we need to modify the prompt and get the agent to output the diff wrapped in <execute_edit> tag. From there on, actions/observation is the only way to actuate the edit using the runtime.

@xingyaoww Can probably explain this better?

Oh, no, for that the "user_prompt.j2" template (codeact_agent folder) and the docstrings in the file_ops.py file are used for. :)

@tobitege
Copy link
Collaborator

tobitege commented Sep 8, 2024

You might take a peek at the PromptManager (its a rather recent addition):
https://github.com/All-Hands-AI/OpenHands/blob/5100d12cea2cd35c30a22e25fbac376b72ed0981/openhands/utils/prompt.py

Also, don't worry about the integration tests for now. We can regenerate the mock files for you when the time is right.

@xingyaoww
Copy link
Contributor

The actual action then still uses the same agentskill methods to do the actual operation. What's the difference here to how Aider does its implementation?

@tobitege Yeah, the main difference between this PR and our current agentskill implementation is that we expose an editing interface to the agent that does not require explicit string escape (e.g., you would need to escape strings when calling Python functions from agentskills - which causes weird bugs). Requiring LLM to return stuff using JSON is also shown to be pretty problematic due to the need to escape stuff (https://aider.chat/2024/08/14/code-in-json.html).

If this diff edit format worked (improved performance), we can implement this action directly on the runtime client's side like FileRead and FileWrite. If it didn't work - I'd say we don't bother.

@tobitege
Copy link
Collaborator

tobitege commented Sep 9, 2024

@tobitege Yeah, the main difference between this PR and our current agentskill implementation is that we expose an editing interface to the agent that does not require explicit string escape (e.g., you would need to escape strings when calling Python functions from agentskills - which causes weird bugs). Requiring LLM to return stuff using JSON is also shown to be pretty problematic due to the need to escape stuff (https://aider.chat/2024/08/14/code-in-json.html).

Alright, sounds like it should help prevent a lot of issues this way. 👍

Copy link
Contributor

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation overall LGTM! - I'd be interested to see if we can improve the AiderBench score on this. @RajWorking LMK if you need more credits!

Once we can get some observable performance improvement on AiderBench, I can run this on SWE-Bench too :D

'edit_file_by_replace',
'insert_content_at_line',
'append_file',
] ## DISABLED TEMPORARILY.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might also consider disable things like create_file and include some instruction in the prompt, telling the LLM that you can also create new files using EditAction (if this is true ofc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good idea, maybe we can do that in a subsequent PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - i think you can experiment with this when you are running benchmarks (e.g., try this and see if it increases score)

@tobitege
Copy link
Collaborator

tobitege commented Sep 9, 2024

@RajWorking if you'd update the branch (either clicking the button or merge main), I could run the regeneration of the integration files for you, just let me know. :)

@RajWorking
Copy link
Contributor Author

Performance on the Aider Bench:
Slight improvement: (Number of passed tests: 84/133)

image

image

@xingyaoww
Copy link
Contributor

@RajWorking Can you plot before vs. after? Are you using testcases? Is there a change in average number of "turns" / average cost?

@xingyaoww
Copy link
Contributor

Once #3829 and #3836 are merged. I'll start an SWE-Bench Lite eval on this PR 😄

@xingyaoww
Copy link
Contributor

Probably also a good time to bump the version of CodeAct to 1.10 since there's significant prompt changes - i can push a commit
https://github.com/All-Hands-AI/OpenHands/pull/3777/files#diff-9e2cd489ef2b4ad5f779049fd4a91abe7adea0e5b038fa2a898d1688451ae0eeL38

@xingyaoww
Copy link
Contributor

@RajWorking We only got 74/300 on this branch, using the exact same protocol i did in this PR which got us 89/300. I'm going to re-run eval on the main branch now and investigate whether this degradation comes from the recent main branch or your editing.

In the meantime, feel free to check this eval output file for weird cases / spaces for improvements: claude-3-5-sonnet@20240620_maxiter_30_N_v1.10-no-hint.zip this

image

@tobitege
Copy link
Collaborator

  • I did try to get one instance (57) to a success, but failed after 8 attempts, in spite of trying a couple of tweaks.
  • ENABLE_AUTO_LINT=true seems not being implemented, which could help in providing a more meaningful error message to the LLM in case of syntax errors
  • One line I added to the system prompt for example:
    SEARCH REQUIRES PROPER INDENTATION! If the assistant plans to add the line ' print(x)', it must fully write the line out, with ALL leading spaces before the code!
  • My first attempt was the worst as basically all observations had this "truncated" in the middle of it, even though I doubt it had more than 10K characters and I had no other limit specified.
  • Wrong indendation seems to be a major problem, almost by design and not just for this PR's approach:
    if the LLM does manage to replace something, it is often not the full method and places something in the file, that doesn't "fit" with the code following it.
    -> maybe we need to change prompts to only replace full methods or something to prevent this.
  • The regex for search is for above reasons, imo, way too strict. LLM's often skip inline comments or have other "coding styles" and requiring them to regurgitate the whole exact same code seems counterproductive, compared to just using line numbers - this is by design and not because of this PR
  • maybe we could try to use the same "fuzzy" search and replace from the existing agent skills for the actual operation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants