Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot replicate the performance of distilled 1.5B model #194

Closed
huangyuxiang03 opened this issue Feb 5, 2025 · 8 comments
Closed

Cannot replicate the performance of distilled 1.5B model #194

huangyuxiang03 opened this issue Feb 5, 2025 · 8 comments

Comments

@huangyuxiang03
Copy link

Hi,
Thanks for your effort!
When I evalutate deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B using the provided code in this repo on math-500, I cannot reproduce the reported performance. I'm only getting a 0.756 but the reported score of open-r1 is 0.816 and deepseek reports 0.839 in their technical report. The script I'm using is provided below:

MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8"
TASK=math_500
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
    --save-details \
    --output-dir $OUTPUT_DIR 

Thanks for looking into this issue. Appreciate your work again!

@ChenDRAG
Copy link

ChenDRAG commented Feb 5, 2025

Same with me. I'm also looking into the code. I could due to the temperature setting difference. The sampling temp is now defaulted to 1.0 I think. Don't know if this cause much problem.

@lewtun
Copy link
Member

lewtun commented Feb 6, 2025

Hello @huangyuxiang03 we've found a regression in our LaTeX parser and bumping to the new version should fix the discrepancy:

uv pip install latex2sympy2_extended==1.0.5

Please let me know if that works!

@deepdata-foundation
Copy link

Hello @lewtun after Successfully installed latex2sympy2_extended-1.0.6 deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B on math-500 is aime_24 is as following:

Image

Image

@lewtun
Copy link
Member

lewtun commented Feb 6, 2025

Hmm that is odd: after merging #196 I am able to get

Task Version Metric Value Stderr
all extractive_match 0.822 ± 0.0171
custom:math_500:0 1 extractive_match 0.822 ± 0.0171

Can you please update to main and then re-install lighteval, along with the new math-verify and latex2sympy2_extended-1.0.6 versions?

@micrazy
Copy link

micrazy commented Feb 7, 2025

Hi, could you share your training parameters? I used the official script and my score was relatively low.

@deepdata-foundation
Copy link

@lewtun
It works for Math-500 DeepSeek-R1-Distill-Qwen-1.5B as the following:

Image

@deepdata-foundation
Copy link

@lewtun
It works for Math-500 DeepSeek-R1-Distill-Qwen-1.5B as the following(gpqa):

Image

Could you please update evaluation for the code generation.

@huangyuxiang03
Copy link
Author

After installing latex2sympy2_extended-1.0.6 and merging #196 , I'm getting the correct performance of 81.6.

|      Task       |Version|     Metric     |Value|   |Stderr|
|-----------------|------:|----------------|----:|---|-----:|
|all              |       |extractive_match|0.816|±  |0.0173|
|custom:math_500:0|      1|extractive_match|0.816|±  |0.0173|

Thank you all for the help. Since my problem is solved, I'm closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants