-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please correct the following DeepSpeed config values that mismatch TrainingArguments values: scheduler.params.total_num_steps=0 vs hf num_training_steps (calculated)= 260 #29348
Comments
cc @pacman100 and @SunMarc |
Gentle ping, @pacman100 |
Another ping @pacman100 |
Any update on this issue? |
mee too gets this issue |
@Refinath or @rexxxx1234 can you please provide the code for your |
deepspeed config
|
Hi @Refinath, I notice you're using trl's DPOTrainer not transformer's Trainer. After looking into it, it seems like trl is not correctly replacing "auto" in the deepspeed config by calling @srcao-bingo @rexxxx1234 can you please provide the code for your TrainingArguments, Trainer, and your deepspeed config you are using? Are you using transformer's trainer? |
System Info
transformers
version: 4.36.2Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
raise ValueError(
ValueError: Please correct the following DeepSpeed config values that mismatch TrainingArguments values:
The easiest method is to set these DeepSpeed config values to 'auto'.
When I use transformers==4.28.1 + deepspeed==0.13.3 for Llama2 fine-tuning, the code runs normally and training is completed. This error occurs when I upgrade transformers to 4.36.x, 4.37.x or 4.38.1 respectively.
And I have not modified the default_offload_opt_param.json file of deepspeed. The contents of the file are as follows:
The value of scheduler.params.total_num_steps is always "auto".
Expected behavior
please fix this bug
The text was updated successfully, but these errors were encountered: