Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SWE-agent tools as sandbox plugins #1305

Merged
merged 9 commits into from
Apr 24, 2024
Merged

Conversation

xingyaoww
Copy link
Collaborator

@xingyaoww xingyaoww commented Apr 23, 2024

Based on sandbox plugins introduced in #1255, this PR adds command line tools from SWE-Agent as a plugin that can be initialized and available for all agents.

This PR also slightly adds some lightweight dependencies that the SWE agent requires, which is pretty inconvenient and slow (we need to re-run apt-get update and then install) to add post hoc.

Here's a demo by running python3 opendevin/sandbox/docker/ssh_box.py:

Be sure to docker pull ghcr.io/opendevin/sandbox:xw-swe-agent-tool-plugins and use this image for testing.

image image

@rbren
Copy link
Collaborator

rbren commented Apr 23, 2024

So we actually had started taking a different approach here: #846

The idea was to convert all of SWE-agent's commands into Actions. We already have support for most of them, with a little translation.

IMO this will be important for an actual SWE-agent integration into OpenDevin--if we just expose these commands as bash commands, we don't get any structured data about e.g. what files are being edited and how.

I suppose one approach would be to merge this PR as-is, but have the agent implementation eventually intercept open, scroll_up, scroll_down etc and convert them into Actions. WDYT?

Comment on lines +414 to +419
ssh_box.init_plugins([JupyterRequirement(), SWEAgentCommandsRequirement()])
logger.info(
'--- SWE-AGENT COMMAND DOCUMENTATION ---\n'
f'{SWEAgentCommandsRequirement().documentation}\n'
'---'
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we should consider pulling this main logic into sandbox.py, and making both sandbox type and the plugin list a command-line arg!

@xingyaoww
Copy link
Collaborator Author

xingyaoww commented Apr 23, 2024

@rbren, I totally understand your concern about translating these commands into structured data - however, I kinda don't want to constraint ourselves to those list of SWE-Agent actions, which may grow and become unmanagable at some point.

What we quantitatively verified (across 10+ LLMs) in the CodeAct paper is that:

  • code (e.g., raw bash command) has the benefit of compositionality (i.e., it gets more things done in one command), say if the agent now what to save all the output of search_dir into one file for future use, they can just run search_dir XXX > note.txt. And quantitatively, it translates to better final task performances, especially complex ones. Translating everything to structured Action may constraint the agent's flexibility to perform tasks.

  • The idea of converting every SWE-Agent action into a structured Action and parsing the agent's output is essentially a variant of "Text/JSON as action" discussed in the paper. It can get simple things done, but you must keep adding/expanding the action space when the task gets harder / ask for more. For the same example, you may need to introduce a different Action if you want to save the results of search_dir XXX to a file, which is much more complex than having the agent execute search_dir XXX > note.txt. Based on my experiences, it could be a complete nightmare when you try to parse the model's action ("e.g., intercept open, scroll_up, scroll_down etc and convert them into Actions"), what if the model just echo something like echo "scroll_up is fun?" that shouldn't be executed at all? Your parser eventually will get extremely complex in order to handle most (not all!) of these cases, and this is only for ONE action.

IMO this will be important for an actual SWE-agent integration into OpenDevin--if we just expose these commands as bash commands, we don't get any structured data about e.g. what files are being edited and how.

The assumed benefit of structured Action for tracking file edits might not be valid.
There are thousands of ways the agent could modify / edit files in the workspace. They may accidentally delete some files via rm without using these APIs, run a git pull that completely changes the content of the directory, or apply a git patch to change the content of files. IMO, casting these SWE-agent actions into the Action class is not gonna be a fundamental solution for tracking file edits as long as the agent has access to bash. Rather, I think it'd probably be more efficient and fundamental if we pursue a git-style approach to monitor the file change (I think there's some Linux primitive (inotify?) that allows you to do that very efficiently).


However, at the end of the day, each agent's implementation can (and should) be completely isolated. I actually prefer that we maintain a very small set of "core" Actions well (e.g., BashRun, PythonRun). And if there's really a need for structured Action, we can starts an "auxiliary action library folder" (or actions/contrib?) that provide those actions by wrapping around core actions like BashRun and/or PythonRun? WDYT?

@rbren
Copy link
Collaborator

rbren commented Apr 23, 2024

Some good points here! I'm in the process of reading your paper.

One of my main goals for OpenDevin is to provide users with a transparent window into what the agent is doing. That way users have a sense of control and certainty. If the agent is just running in a black box, the user will worry that it's doing the wrong thing, wasting time, or even doing something harmful. A black box also makes it harder to measure and debug failures.

The best way to provide visibility is to put as much of the agent's behavior as possible into structured data. That doesn't mean everything needs to be structured data though! We definitely need to give the agent the chance to run arbitrary bash/python/etc as it goes about its task.

To put it succinctly: actions should be structured as much as possible without constraining the agent.

I suppose one approach would be to merge this PR as-is, but have the agent implementation eventually intercept open, scroll_up, scroll_down etc and convert them into Actions. WDYT?

This is what I was getting at here--we can give the agent access to all the raw SWE-agent bash commands like scroll_up, but when we notice a really simple, non-composed command, we translate it into a structured Action. We should be able to do this with ~100% precision, so long as we accept imperfect recall.

The Action will have the exact same effect as the bash, so far as the agent can tell. But it lets us provide a lot more feedback to the user as to what the agent was doing.

There are thousands of ways the agent could modify / edit files in the workspace

Totally agree we can't use read/write Actions as a source-of-truth about file changes. Instead, we should only treat read/write Actions as a partial record.

I think it'd probably be more efficient and fundamental if we pursue a git-style approach to monitor the file change

Someone brought up the idea of basically archiving the whole workspace at every step, so you can fast-forward and rewind! Listening to inotify could be helpful as well. But I don't think either is a replacement for having structured Actions

I actually prefer that we maintain a very small set of "core" Actions

Completely agree here. I don't think we'll add more than a few beyond what we have today.

And if there's really a need for structured Action, we can starts an "auxiliary action library folder" (or actions/contrib?) that provide those actions by wrapping around core actions

I could be convinced otherwise, but I don't even want to maintain auxiliary/contrib actions. Agents can create pseudo-actions inside their implementation, and translate them into core Actions. Maybe if we see the same pseudo-actions getting used over and over we should explore some kind of auxiliary angle.

@xingyaoww
Copy link
Collaborator Author

"Someone brought up the idea of basically archiving the whole workspace at every step, so you can fast-forward and rewind!"

I like this one! It is just like the one in Devin's demo video, where you can scroll it forward and backward to check the agent's action.

I could be convinced otherwise, but I don't even want to maintain auxiliary/contrib actions. Agents can create pseudo-actions inside their implementation, and translate them into core Actions.

Exactly! I like this approach. I only mentioned auxillary is because I'm worried we might add too many actions for SWE-Bench eventually..

One of my main goals for OpenDevin is to provide users with a transparent window into what the agent is doing. That way, users have a sense of control and certainty.

If we want the user to feel sense of control and certainty, we can:

  • As you suggest, we can try to get structured output and highlight what's being changed in the UI.
  • Add a confirmation mode of execution, the user needs to click a button to confirm the execution of certain actions before they get executed by the agent.
  • Or we just try to keep track of the entire workspace using git on the backend and make a commit on the backend for every action taken (so that we can fast-forward & rewind). Or i believe docker layer (that only track changes i think?) might also be a good alternative here.

For now, i think the most easy way would be (1) and (2) above. I'm not against getting structured output info for display, as long as that parsing process itself does not constraints/limit the agent in anyway (i.e., it is just for user's information).

The Action will have the exact same effect as the bash, so far as the agent can tell. But it lets us provide a lot more feedback to the user as to what the agent was doing.

So how about we keep this PR as is and let agent have access to the raw bash command, and we starts a new PR and add a field (e.g., OperationType) to the base ExecutableAction that attempts to understand/classify what a generic bash Action is doing by parsing the content of core action (e.g., BashRun) to structured format that can be displayed by the frontend? This is just for user info and has NO way to impact the agent's performance. If agent implementer prefers structured action, they wrap around these core actions in their implementation.

This way, we can get away with only keeping these core Actions without expanding them, yet having ways to get structured info out of them for sense of control.

In the future, it is also possible to add a very fast, cheap smaller LLM specialized in interpreting commands to OperationType that improves the parsing progress (e.g., parse scroll_up bash commands - similar to argument extraction task what i did in this paper). WDYT?

add missing _split_string
@rbren
Copy link
Collaborator

rbren commented Apr 24, 2024

Or we just try to keep track of the entire workspace using git on the backend and make a commit on the backend for every action taken

I did this in an early agent I wrote! It works OK but kind of pollutes the git log. Probably something we could explore though...git is super powerful.

add a field (e.g., OperationType) to the base ExecutableAction that attempts to understand/classify what a generic bash Action is doing by parsing the content of core action (e.g., BashRun) to structured format that can be displayed by the frontend? This is just for user info and has NO way to impact the agent's performance. If agent implementer prefers structured action, they wrap around these core actions in their implementation.

I'm not sure about this--let me think more before we go down this path

In the future, it is also possible to add a very fast, cheap smaller LLM specialized in interpreting commands to OperationType that improves the parsing progress

This is a really neat idea!

@rbren
Copy link
Collaborator

rbren commented Apr 24, 2024

This definitely makes for a more powerful agent, so let's get it in, and then we can figure out the observability/structure piece later

@xingyaoww xingyaoww merged commit a0e8fcb into main Apr 24, 2024
@xingyaoww xingyaoww deleted the xw/swe-agent-tool-plugins branch April 24, 2024 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants