Cannot replicate the performance of distilled 1.5B model #194

huangyuxiang03 · 2025-02-05T13:16:30Z

Hi,
Thanks for your effort!
When I evalutate deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B using the provided code in this repo on math-500, I cannot reproduce the reported performance. I'm only getting a 0.756 but the reported score of open-r1 is 0.816 and deepseek reports 0.839 in their technical report. The script I'm using is provided below:

MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8"
TASK=math_500
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
    --save-details \
    --output-dir $OUTPUT_DIR

Thanks for looking into this issue. Appreciate your work again!

The text was updated successfully, but these errors were encountered:

ChenDRAG · 2025-02-05T23:01:25Z

Same with me. I'm also looking into the code. I could due to the temperature setting difference. The sampling temp is now defaulted to 1.0 I think. Don't know if this cause much problem.

lewtun · 2025-02-06T11:27:36Z

Hello @huangyuxiang03 we've found a regression in our LaTeX parser and bumping to the new version should fix the discrepancy:

uv pip install latex2sympy2_extended==1.0.5

Please let me know if that works!

deepdata-foundation · 2025-02-06T14:16:26Z

Hello @lewtun after Successfully installed latex2sympy2_extended-1.0.6 deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B on math-500 is aime_24 is as following:

lewtun · 2025-02-06T14:26:44Z

Hmm that is odd: after merging #196 I am able to get

Task	Version	Metric	Value		Stderr
all		extractive_match	0.822	±	0.0171
custom:math_500:0	1	extractive_match	0.822	±	0.0171

Can you please update to main and then re-install lighteval, along with the new math-verify and latex2sympy2_extended-1.0.6 versions?

micrazy · 2025-02-07T02:20:31Z

Hi, could you share your training parameters? I used the official script and my score was relatively low.

deepdata-foundation · 2025-02-07T06:40:33Z

@lewtun
It works for Math-500 DeepSeek-R1-Distill-Qwen-1.5B as the following:

deepdata-foundation · 2025-02-07T12:26:34Z

@lewtun
It works for Math-500 DeepSeek-R1-Distill-Qwen-1.5B as the following(gpqa):

Could you please update evaluation for the code generation.

huangyuxiang03 · 2025-02-10T03:14:57Z

After installing latex2sympy2_extended-1.0.6 and merging #196 , I'm getting the correct performance of 81.6.

|      Task       |Version|     Metric     |Value|   |Stderr|
|-----------------|------:|----------------|----:|---|-----:|
|all              |       |extractive_match|0.816|±  |0.0173|
|custom:math_500:0|      1|extractive_match|0.816|±  |0.0173|

Thank you all for the help. Since my problem is solved, I'm closing this issue.

huangyuxiang03 closed this as completed Feb 10, 2025

superdocker mentioned this issue Feb 18, 2025

Fail to reproduce MATH-500 Score on DeepSeek-R1-Distill-Qwen-1.5B #354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot replicate the performance of distilled 1.5B model #194

Cannot replicate the performance of distilled 1.5B model #194

huangyuxiang03 commented Feb 5, 2025

ChenDRAG commented Feb 5, 2025

lewtun commented Feb 6, 2025

deepdata-foundation commented Feb 6, 2025

lewtun commented Feb 6, 2025

micrazy commented Feb 7, 2025

deepdata-foundation commented Feb 7, 2025

deepdata-foundation commented Feb 7, 2025

huangyuxiang03 commented Feb 10, 2025

Cannot replicate the performance of distilled 1.5B model #194

Cannot replicate the performance of distilled 1.5B model #194

Comments

huangyuxiang03 commented Feb 5, 2025

ChenDRAG commented Feb 5, 2025

lewtun commented Feb 6, 2025

deepdata-foundation commented Feb 6, 2025

lewtun commented Feb 6, 2025

micrazy commented Feb 7, 2025

deepdata-foundation commented Feb 7, 2025

deepdata-foundation commented Feb 7, 2025

huangyuxiang03 commented Feb 10, 2025