-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Hi, thanks for sharing this impressive work. While evaluating Dream-org/Dream-Coder-v0-Instruct-7B on the HumanEval and MBPP benchmarks, I observed that the reproduced results do not match the performance reported in the paper.
For humaneval, I used a command aligned with the instructions in the Dream's official repository (https://github.com/DreamLM/Dream/blob/main/eval_instruct/eval.sh):
HF_ALLOW_CODE_EVAL=1 PYTHONPATH=. accelerate launch --main_process_port 12334 -m lm_eval \
--model diffllm \
--model_args pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=768,diffusion_steps=768,dtype="bfloat16",temperature=0.1,top_p=0.9,alg="entropy" \
--tasks humaneval_instruct \
--device cuda \
--batch_size 1 \
--num_fewshot 0 \
--output_path output_reproduce/humaneval \
--log_samples --confirm_run_unsafe_code \
--apply_chat_templateMy reproduction yields a HumanEval score of 74.39:
diffllm (pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=768,diffusion_steps=768,dtype=bfloat16,temperature=0.1,top_p=0.9,alg=entropy), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1
| Tasks |Version| Filter |n-shot|Metric| |Value | |Stderr|
|------------------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval_instruct| 2|create_test| 0|pass@1| |0.7439|± |0.0342|
This is significantly below the 82.9 reported in the paper.
For MBPP, I used the command:
HF_ALLOW_CODE_EVAL=1 PYTHONPATH=. accelerate launch --main_process_port 12334 -m lm_eval \
--model diffllm \
--model_args pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=1024,diffusion_steps=1024,dtype="bfloat16",temperature=0.1,top_p=0.9,alg="entropy" \
--tasks mbpp_instruct \
--device cuda \
--batch_size 1 \
--num_fewshot 0 \
--output_path output_reproduce/mbpp \
--log_samples --confirm_run_unsafe_code \
--apply_chat_templateMy reproduced MBPP score is 66.84, which is also lower than the 79.6 reported in the paper.
Could you clarify why these discrepancies occur or whether additional evaluation settings are required to match the results in the paper? Thank you!