-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPQA Diamond and fix evaluation deps #196
Merged
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
20a3229
Add GPQA Diamond
lewtun 0d43221
Add table
lewtun e4acb4b
Merge branch 'main' into lewtun/add-gpqa-cmd
lewtun b11bbe8
Fix README
lewtun 107da00
Merge branch 'main' into lewtun/add-gpqa-cmd
lewtun 8dc4c91
Up
lewtun cc10a80
Fixes
lewtun 9fdcc7e
Ignore logs
lewtun 665af3b
Fix
lewtun 3c88f5e
Pin deps
lewtun c624fd4
Fix GRPO
lewtun 9f3d1df
Add Llama 70B tabels
lewtun 2f84345
Restore dp
lewtun abe7989
Merge branch 'main' into lewtun/add-gpqa-cmd
lewtun 4566e00
Pin lighteval
lewtun 78ac6a8
Use bfloat16
lewtun 7b5b322
Tune table
lewtun 0dc6320
Add note
lewtun File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -50,23 +50,23 @@ To install `uv`, follow the [UV Installation Guide](https://docs.astral.sh/uv/ge | |
|
||
|
||
```shell | ||
uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip | ||
uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip --link-mode=copy | ||
``` | ||
|
||
Next, install vLLM: | ||
|
||
```shell | ||
uv pip install vllm>=0.7.0 | ||
uv pip install vllm==0.7.1 | ||
|
||
# For CUDA 12.1 | ||
pip install vllm>=0.7.0 --extra-index-url https://download.pytorch.org/whl/cu121 | ||
uv pip install vllm==0.7.1 --extra-index-url https://download.pytorch.org/whl/cu121 --index-strategy unsafe-best-match --link-mode=copy | ||
export LD_LIBRARY_PATH=$(python -c "import site; print(site.getsitepackages()[0] + '/nvidia/nvjitlink/lib')"):$LD_LIBRARY_PATH | ||
``` | ||
|
||
This will also install PyTorch `v2.5.1` and it is **very important** to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via `pip install -e .[LIST OF MODES]`. For most contributors, we recommend: | ||
|
||
```shell | ||
pip install -e ".[dev]" | ||
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]" --link-mode=copy | ||
``` | ||
|
||
Next, log into your Hugging Face and Weights and Biases accounts as follows: | ||
|
@@ -141,30 +141,46 @@ We use `lighteval` to evaluate models, with custom tasks defined in `src/open_r1 | |
|
||
```shell | ||
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | ||
MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8" | ||
TASK=aime24 | ||
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8" | ||
OUTPUT_DIR=data/evals/$MODEL | ||
|
||
# AIME 2024 | ||
TASK=aime24 | ||
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--output-dir $OUTPUT_DIR | ||
|
||
# MATH-500 | ||
TASK=math_500 | ||
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--output-dir $OUTPUT_DIR | ||
|
||
# GPQA Diamond | ||
TASK=gpqa:diamond | ||
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not needed for the DeepSeek models (gives ~1 point gain if included) |
||
--output-dir $OUTPUT_DIR | ||
``` | ||
|
||
> [!IMPORTANT] | ||
> You must set `max_model_length=32768` in the `vllm` command to align with the `generation_size` we define per eval. Without this, `lighteval` will throw an error. | ||
|
||
To increase throughput across multiple GPUs, use _data parallel_ as follows: | ||
|
||
```shell | ||
NUM_GPUS=8 | ||
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | ||
MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8" | ||
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8" | ||
TASK=aime24 | ||
OUTPUT_DIR=data/evals/$MODEL | ||
|
||
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \ | ||
--output-dir $OUTPUT_DIR | ||
``` | ||
|
||
|
@@ -173,58 +189,105 @@ For large models which require sharding across GPUs, use _tensor parallel_ and r | |
```shell | ||
NUM_GPUS=8 | ||
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | ||
MODEL_ARGS="pretrained=$MODEL,dtype=float16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8" | ||
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8" | ||
TASK=aime24 | ||
OUTPUT_DIR=data/evals/$MODEL | ||
|
||
export VLLM_WORKER_MULTIPROC_METHOD=spawn | ||
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \ | ||
--output-dir $OUTPUT_DIR | ||
``` | ||
|
||
You can also launch an evaluation with `make evaluate`, specifying the model, task, and optionally the parallelism technique and number of GPUs. | ||
|
||
To evaluate on a single GPU: | ||
|
||
```shell | ||
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 | ||
``` | ||
|
||
To use Data Parallelism: | ||
|
||
```shell | ||
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=data NUM_GPUS=8 | ||
``` | ||
|
||
To use Tensor Parallelism: | ||
|
||
```shell | ||
make evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=tensor NUM_GPUS=8 | ||
``` | ||
## Reproducing Deepseek's evaluation results on MATH-500 | ||
We are able to reproduce Deepseek's reported results on the MATH-500 Benchmark: | ||
| Model | MATH-500 (HF lighteval) | MATH-500 (DeepSeek Reported) | | ||
| :-------------------------- | :-------: | :----------------------------: | | ||
| DeepSeek-R1-Distill-Qwen-1.5B | 81.6 | 83.9 | | ||
| DeepSeek-R1-Distill-Qwen-7B | 91.8 | 92.8 | | ||
| DeepSeek-R1-Distill-Qwen-14B | 94.2 | 93.9 | | ||
| DeepSeek-R1-Distill-Qwen-32B | 95.0 | 94.3 | | ||
| DeepSeek-R1-Distill-Llama-8B | 85.8 | 89.1 | | ||
| DeepSeek-R1-Distill-Llama-70B | 93.4 | 94.5 | | ||
|
||
## Reproducing Deepseek's evaluation results | ||
|
||
> [!NOTE] | ||
> The DeepSeek-R1 paper uses sampling with a temperature of 0.6, a top-p value of 0.95, and 64 responses per query to estimate `pass@1`. Below, we report the results from greedy decoding, which likely explains the small 1-3σ discrepancies between our results and theirs. | ||
|
||
### MATH-500 | ||
|
||
We are able to reproduce Deepseek's reported results on the MATH-500 benchmark within ~1-3 standard deviations: | ||
|
||
| Model | MATH-500 (🤗 LightEval) | MATH-500 (DeepSeek Reported) | | ||
|:------------------------------|:-----------------------:|:----------------------------:| | ||
| DeepSeek-R1-Distill-Qwen-1.5B | 81.2 | 83.9 | | ||
| DeepSeek-R1-Distill-Qwen-7B | 91.8 | 92.8 | | ||
| DeepSeek-R1-Distill-Qwen-14B | 94.2 | 93.9 | | ||
| DeepSeek-R1-Distill-Qwen-32B | 95.0 | 94.3 | | ||
| DeepSeek-R1-Distill-Llama-8B | 85.4 | 89.1 | | ||
| DeepSeek-R1-Distill-Llama-70B | 93.4 | 94.5 | | ||
|
||
To reproduce these results use the following command: | ||
|
||
```shell | ||
NUM_GPUS=1 # Set to 8 for 32B and 70B models | ||
MODEL=deepseek-ai/{model_name} | ||
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS" | ||
OUTPUT_DIR=data/evals/$MODEL | ||
|
||
lighteval vllm $MODEL_ARGS "custom|math_500|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--output-dir $OUTPUT_DIR | ||
``` | ||
|
||
Alternatively, you can launch Slurm jobs as follows: | ||
|
||
```shell | ||
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B math_500 | ||
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-7B math_500 | ||
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-14B math_500 | ||
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Qwen-32B math_500 tp | ||
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Llama-8B math_500 | ||
sbatch slurm/evaluate.slurm deepseek-ai/DeepSeek-R1-Distill-Llama-70B math_500 tp | ||
python scripts/run_benchmarks.py --model-id={model_id} --benchmarks math_500 | ||
``` | ||
|
||
### GPQA Diamond | ||
|
||
We are able to reproduce Deepseek's reported results on the GPQA Diamond benchmark within ~1-3 standard deviations: | ||
|
||
| Model | GPQA Diamond (🤗 LightEval) | GPQA Diamond (DeepSeek Reported) | | ||
|:------------------------------|:---------------------------:|:--------------------------------:| | ||
| DeepSeek-R1-Distill-Qwen-1.5B | 33.3 | 33.8 | | ||
| DeepSeek-R1-Distill-Qwen-7B | 48.4 | 49.1 | | ||
| DeepSeek-R1-Distill-Qwen-14B | 55.6 | 59.1 | | ||
| DeepSeek-R1-Distill-Qwen-32B | 58.6 | 62.1 | | ||
| DeepSeek-R1-Distill-Llama-8B | 51.0 | 49.0 | | ||
| DeepSeek-R1-Distill-Llama-70B | 65.2 | 65.2 | | ||
|
||
To reproduce these results use the following command: | ||
|
||
```shell | ||
NUM_GPUS=1 # Set to 8 for 32B and 70B models | ||
MODEL=deepseek-ai/{model_name} | ||
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilisation=0.8,tensor_parallel_size=$NUM_GPUS" | ||
OUTPUT_DIR=data/evals/$MODEL | ||
|
||
lighteval vllm $MODEL_ARGS "custom|gpqa:diamond|0|0" \ | ||
--custom-tasks src/open_r1/evaluate.py \ | ||
--use-chat-template \ | ||
--output-dir $OUTPUT_DIR | ||
``` | ||
|
||
```shell | ||
python scripts/run_benchmarks.py --model-id={model_id} --benchmarks gpqa | ||
``` | ||
|
||
## Data generation | ||
|
||
|
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed because
uv
cannot installlighteval
otherwise due to some LFS file conflictThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I had this issue, I had reverted back to pip, glad you fixed it.