Add GPQA Diamond and fix evaluation deps #196

lewtun · 2025-02-05T16:29:23Z

Adds GPQA diamond and various important fixes for evaluation (parsing & incompat between latest vllm and lighteval). I've also unified the Slurm scripts for evaluation so we don't have multiple ways to eval models.

TODO

Update Slurm script to disable system prompt for GPQA
Rerun evals once fix uv env path + details #188 and bump lighteval + math-verify #193 are merged and env is updated

lewtun · 2025-02-06T13:07:50Z

README.md

 export LD_LIBRARY_PATH=$(python -c "import site; print(site.getsitepackages()[0] + '/nvidia/nvjitlink/lib')"):$LD_LIBRARY_PATH
 ```

 This will also install PyTorch `v2.5.1` and it is **very important** to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via `pip install -e .[LIST OF MODES]`. For most contributors, we recommend:

 ```shell
-pip install -e ".[dev]"
+GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]" --link-mode=copy


Needed because uv cannot install lighteval otherwise due to some LFS file conflict

Ah, I had this issue, I had reverted back to pip, glad you fixed it.

lewtun · 2025-02-06T13:08:25Z

README.md

 lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
-    --system-prompt="Please reason step by step, and put your final answer within \boxed{}." \


Not needed for the DeepSeek models (gives ~1 point gain if included)

slurm/evaluate.slurm

lewtun · 2025-02-06T13:27:48Z

setup.py

@@ -53,17 +53,17 @@
    "huggingface-hub[cli]>=0.19.2,<1.0",
    "isort>=5.12.0",
    "liger_kernel==0.5.2",
-    "lighteval @ git+https://github.com/huggingface/lighteval.git@0e462692436e1f0575bdb4c6ef63453ad9bde7d4#egg=lighteval[math]",
-    "math-verify>=0.3.3",  # Used for math verification in grpo
+    "lighteval @ git+https://github.com/huggingface/lighteval.git@3c9b0c9dde6718b23ef5b0f4960355f0d494bdfc#egg=lighteval[math]",


Bump to latest commit once vllm fix for DDP is merged: huggingface/lighteval#541

done it's 86f62259f105ae164f655e0b91c92a823a742724

* Add GPQA Diamond * Add table * Fix README * Up * Fixes * Ignore logs * Fix * Pin deps * Fix GRPO * Add Llama 70B tabels * Restore dp * Pin lighteval * Use bfloat16 * Tune table * Add note

lewtun added 2 commits February 5, 2025 16:26

Add GPQA Diamond

20a3229

Add table

0d43221

lewtun mentioned this pull request Feb 5, 2025

fix uv env path + details #188

Merged

lewtun added 7 commits February 5, 2025 23:59

Merge branch 'main' into lewtun/add-gpqa-cmd

e4acb4b

Fix README

b11bbe8

Merge branch 'main' into lewtun/add-gpqa-cmd

107da00

Up

8dc4c91

Fixes

cc10a80

Ignore logs

9fdcc7e

Fix

665af3b

lewtun mentioned this pull request Feb 6, 2025

Bug fix extractive match huggingface/lighteval#540

Merged

Pin deps

3c88f5e

lewtun changed the title ~~[WIP] Add GPQA Diamond~~ Add GPQA Diamond and fix evaluation deps Feb 6, 2025

lewtun marked this pull request as ready for review February 6, 2025 13:06

Fix GRPO

c624fd4

lewtun commented Feb 6, 2025

View reviewed changes

Add Llama 70B tabels

9f3d1df

lewtun requested a review from edbeeching February 6, 2025 13:20

Restore dp

2f84345

edbeeching reviewed Feb 6, 2025

View reviewed changes

slurm/evaluate.slurm Outdated Show resolved Hide resolved

edbeeching reviewed Feb 6, 2025

View reviewed changes

slurm/evaluate.slurm Outdated Show resolved Hide resolved

lewtun commented Feb 6, 2025

View reviewed changes

Merge branch 'main' into lewtun/add-gpqa-cmd

abe7989

edbeeching approved these changes Feb 6, 2025

View reviewed changes

Pin lighteval

4566e00

hynky1999 mentioned this pull request Feb 6, 2025

bump lighteval + math-verify #193

Closed

lewtun added 3 commits February 6, 2025 14:13

Use bfloat16

78ac6a8

Tune table

7b5b322

Add note

0dc6320

lewtun merged commit cec57f3 into main Feb 6, 2025
1 check passed

lewtun deleted the lewtun/add-gpqa-cmd branch February 6, 2025 14:24

lewtun mentioned this pull request Feb 6, 2025

Cannot replicate the performance of distilled 1.5B model #194

Closed

qgallouedec mentioned this pull request Feb 6, 2025

fix: work around exceptions from Math-Verify #158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPQA Diamond and fix evaluation deps #196

Add GPQA Diamond and fix evaluation deps #196

lewtun commented Feb 5, 2025 •

edited

Loading

lewtun Feb 6, 2025

edbeeching Feb 6, 2025

lewtun Feb 6, 2025

lewtun Feb 6, 2025

hynky1999 Feb 6, 2025

Add GPQA Diamond and fix evaluation deps #196

Add GPQA Diamond and fix evaluation deps #196

Conversation

lewtun commented Feb 5, 2025 • edited Loading

TODO

lewtun Feb 6, 2025

Choose a reason for hiding this comment

edbeeching Feb 6, 2025

Choose a reason for hiding this comment

lewtun Feb 6, 2025

Choose a reason for hiding this comment

lewtun Feb 6, 2025

Choose a reason for hiding this comment

hynky1999 Feb 6, 2025

Choose a reason for hiding this comment

lewtun commented Feb 5, 2025 •

edited

Loading