Cannot reproduce zephyr-7b-gemma-v0.1 #148

jasonyux · 2024-04-04T16:17:02Z

I tried to reproduce zephyr-7b-gemma-v0.1 using the exact code provided in this repository with 4xA100 GPUs. However, the resulting MT-bench test score was much lower than reported: 6.63, versus the reported value on huggingface pages: 7.81.

I wonder if anyone else is encountering this issue?

Command ran (the same as what's mentioned in the repo but modified gradient accumulation since I am using only 4xA100)

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml \
scripts/run_dpo.py recipes/zephyr-7b-gemma/dpo/config_full.yaml \
--output_dir=xxx/zephyr-7b-gemma-dpo-full_reprod \
--num_train_epochs=2 \
--gradient_accumulation_steps=16

and when generating model answers for MT-bench I used the default commands:

python gen_model_answer.py --model-path [MODEL-PATH] --model-id [MODEL-ID]

Related library versions I used:

Python 3.8.10 (I had to convert some source code from "ClassA | ClassB" to "Union[ClassA, ClassB]"
torch==2.1.2+cu118, transformers==4.39.1, trl==0.8.1, flash-attn==2.5.6, fschat==0.2.36

training curves from wandb:

eval reward curves:

The text was updated successfully, but these errors were encountered:

jasonyux · 2024-04-11T21:59:31Z

It seems that the issue is with chat templates used by fastchat during evaluation. Using the following templates to test H4's gemma models recovers the reported performance:

from fastchat.conversation import register_conv_template

register_conv_template(
    Conversation(
        name="templ=h4_gemma_chatml",
        system_template="<bos><|im_start|>system\n{system_message}",
        system_message="You are an AI assistant.",
        roles=("<|im_start|>user", "<|im_start|>assistant"),
        sep_style=SeparatorStyle.CHATML,
        sep="<|im_end|>",
        stop_str=["<|im_end|>", "<|endoftext|>"],
    )
)

# other init code omitted

fanconic · 2024-05-07T17:14:25Z

May I ask where this template originates from?

jasonyux · 2024-05-15T12:47:36Z

This relates from how the model is trained using the run_dpo.py script. In that script, chat data is first formatted using tokenizer's template and then fed into Trainer. Unless you use (maybe) the latest version of fschat (which uses hardcoded templates), fschat will not use that same template; which leads to performance degradation.

jasonyux closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot reproduce zephyr-7b-gemma-v0.1 #148

Cannot reproduce zephyr-7b-gemma-v0.1 #148

jasonyux commented Apr 4, 2024 •

edited

Loading

jasonyux commented Apr 11, 2024

fanconic commented May 7, 2024

jasonyux commented May 15, 2024

Cannot reproduce zephyr-7b-gemma-v0.1 #148

Cannot reproduce zephyr-7b-gemma-v0.1 #148

Comments

jasonyux commented Apr 4, 2024 • edited Loading

jasonyux commented Apr 11, 2024

fanconic commented May 7, 2024

jasonyux commented May 15, 2024

jasonyux commented Apr 4, 2024 •

edited

Loading