Skip to content

Performance Gap Between Reproduction and Paper Results on HumanEval and MBPP #11

@ZinYY

Description

@ZinYY

Hi, thanks for sharing this impressive work. While evaluating Dream-org/Dream-Coder-v0-Instruct-7B on the HumanEval and MBPP benchmarks, I observed that the reproduced results do not match the performance reported in the paper.

For humaneval, I used a command aligned with the instructions in the Dream's official repository (https://github.com/DreamLM/Dream/blob/main/eval_instruct/eval.sh):

HF_ALLOW_CODE_EVAL=1 PYTHONPATH=. accelerate launch --main_process_port 12334 -m lm_eval \
    --model diffllm \
    --model_args pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=768,diffusion_steps=768,dtype="bfloat16",temperature=0.1,top_p=0.9,alg="entropy" \
    --tasks humaneval_instruct \
    --device cuda \
    --batch_size 1 \
    --num_fewshot 0 \
    --output_path output_reproduce/humaneval \
    --log_samples --confirm_run_unsafe_code \
    --apply_chat_template

My reproduction yields a HumanEval score of 74.39:

diffllm (pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=768,diffusion_steps=768,dtype=bfloat16,temperature=0.1,top_p=0.9,alg=entropy), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1
|      Tasks       |Version|  Filter   |n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval_instruct|      2|create_test|     0|pass@1|   |0.7439|±  |0.0342|

This is significantly below the 82.9 reported in the paper.


For MBPP, I used the command:

HF_ALLOW_CODE_EVAL=1 PYTHONPATH=. accelerate launch --main_process_port 12334 -m lm_eval \
    --model diffllm \
    --model_args pretrained=Dream-org/Dream-Coder-v0-Instruct-7B,trust_remote_code=True,max_new_tokens=1024,diffusion_steps=1024,dtype="bfloat16",temperature=0.1,top_p=0.9,alg="entropy" \
    --tasks mbpp_instruct \
    --device cuda \
    --batch_size 1 \
    --num_fewshot 0 \
    --output_path output_reproduce/mbpp \
    --log_samples --confirm_run_unsafe_code \
    --apply_chat_template

My reproduced MBPP score is 66.84, which is also lower than the 79.6 reported in the paper.


Could you clarify why these discrepancies occur or whether additional evaluation settings are required to match the results in the paper? Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions