All-Hands-AI · Ethan0456 · Oct 10, 2024 · Oct 10, 2024 · Oct 10, 2024 · Oct 10, 2024
diff --git a/evaluation/discoverybench/README.md b/evaluation/discoverybench/README.md
@@ -0,0 +1,37 @@
+# DiscoveryBench with OpenHands
+
+[DiscoveryBench](https://github.com/allenai/discoverybench/) [(Paper)](https://arxiv.org/abs/2407.01725v1) contains 264 tasks collected across 6 diverse domains, such as biology, economics, and sociology. It incorporates discovery workflows from published papers to approximate the real-world challenges faced by researchers.
+
+<p align="center">
+  <a href="[https://github.com/allenai/discoverybench](https://github.com/allenai/discoverybench)">
+    <img src="https://raw.githubusercontent.com/allenai/discoverybench/refs/heads/main/assets/discoverybench-openhands-teaser.png" width="100%" alt="DiscoveryBench Background" />
+  </a>
+</p>
+
+
+## Setup Environment and LLM Configuration
+
+1. Please follow instructions mentioned [here](https://github.com/openlocus/OpenHands/blob/discoverybench-openhands-integration/evaluation/README.md#setup) to setup OpenHands development environment and LLMs locally
+
+2. Execute the bash script to start DiscoveryBench Evaluation
+
+```
+./evaluation/discoverybench/scripts/run_infer.sh [YOUR MODEL CONFIG]
+```
+Replace `[YOUR MODEL CONFIG]` with any model the model that you have set up in `config.toml`
+
+
+## Run Inference on DiscoveryBench Instances
+
+When the `run_infer.sh` script is started, it will automatically pull the latest DiscoveryBench instances & set up the agent environment. The OpenHands agent is invoked to process the task within this environment, producing a hypothesis. We then evaluate it against the “gold” hypothesis provided by DiscoveryBench. The evaluation result, along with the agent chat history is logged to `output.jsonl` under `evaluation_outputs`.
+
+
+```
+./evaluation/discoverybench/scripts/run_infer.sh [MODEL_CONFIG] [GIT_COMMIT] [AGENT] [EVAL_LIMIT] [NUM_WORKERS]
+```
+
+- `MODEL_CONFIG`: Name of the model you want to evaluate with
+- `GIT_COMMIT`: This should be the git commit hash or release tag for OpenHands, e.g., HEAD or a specific tag like 0.6.2.
+- `AGENT`: Use CoderActAgent, right now it only supports that.
+- `EVAL_LIMIT`: Number of samples to evaluate.
+- `NUM_WORKERS`: Number of workers to parallelize the evaluation process.
diff --git a/evaluation/discoverybench/__init__.py b/evaluation/discoverybench/__init__.py
diff --git a/evaluation/discoverybench/eval_utils/README.md b/evaluation/discoverybench/eval_utils/README.md
@@ -0,0 +1,7 @@
+## Evaluation Utils
+
+- **`eval_w_subhypo_gen.py`**: Implements the DiscoveryBench logic for evaluating agent-generated hypotheses.
+- **`lm_utils.py`**: Provides utility functions necessary for the evaluation process.
+- **`openai_helpers.py`**: Includes helper functions for OpenAI-related tasks.
+- **`openai_semantic_gen_prompts.py`**: Contains prompts used for semantic generation.
+- **`response_parser.py`**: Handles the parsing of agent-generated hypotheses.
diff --git a/evaluation/discoverybench/eval_utils/__init__.py b/evaluation/discoverybench/eval_utils/__init__.py