Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Evaluation] DiscoveryBench OpenHands Integration #4562

Closed
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
961374a
init: add discoverybench files
Ethan0456 Oct 10, 2024
534adad
init: add discoverybench evaluation bash script
Ethan0456 Oct 10, 2024
d28cef0
refactor: move utils to eval_utils/
Ethan0456 Oct 10, 2024
ef4796f
refactor: reduce redundancy in log extraction function
Ethan0456 Oct 10, 2024
cb5b369
Merge branch 'All-Hands-AI:main' into discoverybench-openhands-integr…
Ethan0456 Oct 14, 2024
682f151
fix: modify response parser
Ethan0456 Oct 16, 2024
54949ea
chore: remove useless modules
Ethan0456 Oct 18, 2024
903a00d
fix: update discoverybench evaluation
Ethan0456 Oct 18, 2024
985eedf
feat: initialize runtime with libraries
Ethan0456 Oct 18, 2024
a97319b
init: add README
Ethan0456 Oct 18, 2024
ec2721b
docs: update README to add todo
Ethan0456 Oct 18, 2024
2f3689c
Create README.md
suranah Oct 24, 2024
622edf2
docs: Update run_infer.py to add TODO for docstrings
suranah Oct 24, 2024
e62082a
docs: add function doc strings
Ethan0456 Oct 24, 2024
26a831f
docs: add one line eval utils descriptions in README
Ethan0456 Oct 24, 2024
cf1f3c1
docs: Update README.md for more clarity
suranah Oct 25, 2024
a9673c5
docs: Update README.md for more clarity on DiscoveryBench process
suranah Oct 25, 2024
81c8271
docs: Update utils README.md
suranah Oct 25, 2024
edc134f
docs: Update discoverybench README.md to eval context
suranah Oct 25, 2024
6337c52
docs: Update formatting for discoverybench README.md
suranah Oct 25, 2024
fee00c3
docs: Update README.md for clarity
Ethan0456 Oct 25, 2024
811fb7d
chore: remove redundant comments
Ethan0456 Oct 25, 2024
32b1e4a
fix: clean up README formatting to pass linting
Ethan0456 Oct 25, 2024
4fb8ab6
Merge branch 'main' into eval/discoverybench-openhands-integration
suranah Oct 25, 2024
c834796
Fix for docker leak (#4560)
tofarr Oct 25, 2024
9571dc6
feat(eval): rewrite log_completions to save completions to directory …
xingyaoww Oct 25, 2024
ac07dce
fix(eval): add runtime.connect to all eval harness (#4565)
xingyaoww Oct 25, 2024
71a28eb
Small refactor : EventStream as a dataclass (#4557)
tofarr Oct 25, 2024
d328e0b
chore(deps): bump the version-all group across 1 directory with 8 upd…
dependabot[bot] Oct 25, 2024
21e3204
fix(controllor): make agent controller stops when encounter fatal obs…
xingyaoww Oct 26, 2024
409eeff
fix(controller): stop when run into loop (#4579)
xingyaoww Oct 27, 2024
96de4b5
Mention `build-essential` dependency for ubuntu in dev doc (#4511)
ryanhoangt Oct 27, 2024
07947cd
fix(builder): Build the runtime with docker version that contains (-)…
msehsah1 Oct 27, 2024
ba34d22
Remove verbose log from agent controller (#4585)
rbren Oct 27, 2024
fda3110
fix: add runtime.connect to discoverybench eval harness
Ethan0456 Oct 28, 2024
ad9f4c8
fix: unpack two return values from get_dv_query_for_real
Ethan0456 Oct 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions evaluation/discoverybench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# DiscoveryBench with OpenHands

[DiscoveryBench](https://github.com/allenai/discoverybench/) [(Paper)](https://arxiv.org/abs/2407.01725v1) contains 264 tasks collected across 6 diverse domains, such as biology, economics, and sociology. It incorporates discovery workflows from published papers to approximate the real-world challenges faced by researchers.

<p align="center">
<a href="[https://github.com/allenai/discoverybench](https://github.com/allenai/discoverybench)">
<img src="https://raw.githubusercontent.com/allenai/discoverybench/refs/heads/main/assets/discoverybench-openhands-teaser.png" width="100%" alt="DiscoveryBench Background" />
</a>
</p>


## Setup Environment and LLM Configuration

1. Please follow instructions mentioned [here](https://github.com/openlocus/OpenHands/blob/discoverybench-openhands-integration/evaluation/README.md#setup) to setup OpenHands development environment and LLMs locally

2. Execute the bash script to start DiscoveryBench Evaluation

```
./evaluation/discoverybench/scripts/run_infer.sh [YOUR MODEL CONFIG]
```
Replace `[YOUR MODEL CONFIG]` with any model the model that you have set up in `config.toml`


## Run Inference on DiscoveryBench Instances

When the `run_infer.sh` script is started, it will automatically pull the latest DiscoveryBench instances & set up the agent environment. The OpenHands agent is invoked to process the task within this environment, producing a hypothesis. We then evaluate it against the “gold” hypothesis provided by DiscoveryBench. The evaluation result, along with the agent chat history is logged to `output.jsonl` under `evaluation_outputs`.


```
./evaluation/discoverybench/scripts/run_infer.sh [MODEL_CONFIG] [GIT_COMMIT] [AGENT] [EVAL_LIMIT] [NUM_WORKERS]
```

- `MODEL_CONFIG`: Name of the model you want to evaluate with
- `GIT_COMMIT`: This should be the git commit hash or release tag for OpenHands, e.g., HEAD or a specific tag like 0.6.2.
- `AGENT`: Use CoderActAgent, right now it only supports that.
- `EVAL_LIMIT`: Number of samples to evaluate.
- `NUM_WORKERS`: Number of workers to parallelize the evaluation process.
Empty file.
7 changes: 7 additions & 0 deletions evaluation/discoverybench/eval_utils/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
## Evaluation Utils

- **`eval_w_subhypo_gen.py`**: Implements the DiscoveryBench logic for evaluating agent-generated hypotheses.
- **`lm_utils.py`**: Provides utility functions necessary for the evaluation process.
- **`openai_helpers.py`**: Includes helper functions for OpenAI-related tasks.
- **`openai_semantic_gen_prompts.py`**: Contains prompts used for semantic generation.
- **`response_parser.py`**: Handles the parsing of agent-generated hypotheses.
Empty file.
Loading