Please correct the following DeepSpeed config values that mismatch TrainingArguments values: scheduler.params.total_num_steps=0 vs hf num_training_steps (calculated)= 260 #29348

srcao-bingo · 2024-02-28T14:55:06Z

System Info

transformers version: 4.36.2
Platform: Linux-4.15.0-213-generic-x86_64-with-glibc2.27
Python version: 3.9.18
Huggingface_hub version: 0.21.1
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

raise ValueError(
ValueError: Please correct the following DeepSpeed config values that mismatch TrainingArguments values:

ds scheduler.params.total_num_steps=0 vs hf num_training_steps (calculated)=260
The easiest method is to set these DeepSpeed config values to 'auto'.

When I use transformers==4.28.1 + deepspeed==0.13.3 for Llama2 fine-tuning, the code runs normally and training is completed. This error occurs when I upgrade transformers to 4.36.x, 4.37.x or 4.38.1 respectively.
And I have not modified the default_offload_opt_param.json file of deepspeed. The contents of the file are as follows:

{
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": "auto",
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 5,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

The value of scheduler.params.total_num_steps is always "auto".

Expected behavior

please fix this bug

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-03-07T10:24:51Z

cc @pacman100 and @SunMarc

ArthurZucker · 2024-04-01T12:39:00Z

Gentle ping, @pacman100

amyeroberts · 2024-04-26T11:08:30Z

Another ping @pacman100

amyeroberts · 2024-05-21T08:48:56Z

cc @SunMarc @muellerzr

rexxxx1234 · 2024-06-17T14:09:04Z

Any update on this issue?

Refinath · 2024-07-03T06:32:39Z

mee too gets this issue

muellerzr · 2024-07-03T11:07:47Z

@Refinath or @rexxxx1234 can you please provide the code for your TrainingArguments, Trainer, and your deepspeed config you are using? Thanks!

Refinath · 2024-07-05T21:00:03Z

@muellerzr

training_args = DPOConfig(
    learning_rate=args.lr,
    num_train_epochs=args.epochs,
    per_device_train_batch_size=args.batch_size,
    output_dir='./results',
    logging_steps=10,
    remove_unused_columns=False,
    max_length=1024,
    max_prompt_length=512,
    fp16=True,
    deepspeed="ds_config.json"
)

dpo_trainer = DPOTrainer(
    model,
    ref_model,
    beta=beta,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    args=training_args,
)

deepspeed config

{
  "resource": {
    "num_gpus": 0  
  },
  "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
  },
  "optimizer": {
      "type": "AdamW",
      "params": {
          "lr": "auto",
          "weight_decay": "auto",
          "torch_adam": true,
          "adam_w_mode": true
      }
  },
  "scheduler": {
      "type": "WarmupDecayLR",
      "params": {
          "warmup_min_lr": "auto",
          "warmup_max_lr": "auto",
          "warmup_num_steps": 0,
          "total_num_steps": 58
      }
  },
  "zero_optimization": {
      "stage": 3,
      "allgather_partitions": true,
      "allgather_bucket_size": 2e8,
      "overlap_comm": true,
      "reduce_scatter": true,
      "reduce_bucket_size": "auto",
      "contiguous_gradients": true,
      "stage3_gather_16bit_weights_on_model_save": "auto"
  },
  "gradient_accumulation_steps": 1,
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Ben-Schneider-code · 2024-10-02T01:14:23Z

Hi @Refinath, I notice you're using trl's DPOTrainer not transformer's Trainer. After looking into it, it seems like trl is not correctly replacing "auto" in the deepspeed config by calling hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps) when loading their reference model. The current fix for this would be to set num_training_steps equal to the number of steps that your dataset has. I'll followup with trl and hopefully that can be resolved.

@srcao-bingo @rexxxx1234 can you please provide the code for your TrainingArguments, Trainer, and your deepspeed config you are using? Are you using transformer's trainer?

huggingface deleted a comment from github-actions bot Apr 1, 2024

huggingface deleted a comment from github-actions bot Apr 26, 2024

amyeroberts added the DeepSpeed label Apr 26, 2024

huggingface deleted a comment from github-actions bot May 21, 2024

huggingface deleted a comment from github-actions bot Jun 16, 2024

huggingface deleted a comment from github-actions bot Jul 30, 2024

huggingface deleted a comment from github-actions bot Aug 27, 2024

ArthurZucker mentioned this issue Sep 6, 2024

Accelerate x Trainer issue tracker: #33345

Open

43 tasks

huggingface deleted a comment from github-actions bot Sep 23, 2024

Ben-Schneider-code mentioned this issue Oct 2, 2024

Handling of "auto" in deepspeed config causes crash under Zero3 huggingface/trl#2154

Open

4 tasks

Ben-Schneider-code mentioned this issue Oct 14, 2024

Request more specific info from bug reporters when opening deepspeed issues #34145

Open

huggingface deleted a comment from github-actions bot Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please correct the following DeepSpeed config values that mismatch TrainingArguments values: scheduler.params.total_num_steps=0 vs hf num_training_steps (calculated)= 260 #29348

Please correct the following DeepSpeed config values that mismatch TrainingArguments values: scheduler.params.total_num_steps=0 vs hf num_training_steps (calculated)= 260 #29348

srcao-bingo commented Feb 28, 2024 •

edited by ArthurZucker

Loading

ArthurZucker commented Mar 7, 2024

ArthurZucker commented Apr 1, 2024

amyeroberts commented Apr 26, 2024

amyeroberts commented May 21, 2024

rexxxx1234 commented Jun 17, 2024

Refinath commented Jul 3, 2024

muellerzr commented Jul 3, 2024

Refinath commented Jul 5, 2024 •

edited

Loading

Ben-Schneider-code commented Oct 2, 2024 •

edited

Loading

Please correct the following DeepSpeed config values that mismatch TrainingArguments values: scheduler.params.total_num_steps=0 vs hf num_training_steps (calculated)= 260 #29348

Please correct the following DeepSpeed config values that mismatch TrainingArguments values: scheduler.params.total_num_steps=0 vs hf num_training_steps (calculated)= 260 #29348

Comments

srcao-bingo commented Feb 28, 2024 • edited by ArthurZucker Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Mar 7, 2024

ArthurZucker commented Apr 1, 2024

amyeroberts commented Apr 26, 2024

amyeroberts commented May 21, 2024

rexxxx1234 commented Jun 17, 2024

Refinath commented Jul 3, 2024

muellerzr commented Jul 3, 2024

Refinath commented Jul 5, 2024 • edited Loading

Ben-Schneider-code commented Oct 2, 2024 • edited Loading

srcao-bingo commented Feb 28, 2024 •

edited by ArthurZucker

Loading

Refinath commented Jul 5, 2024 •

edited

Loading

Ben-Schneider-code commented Oct 2, 2024 •

edited

Loading