Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please correct the following DeepSpeed config values that mismatch TrainingArguments values: scheduler.params.total_num_steps=0 vs hf num_training_steps (calculated)= 260 #29348

Open
4 tasks
Tracked by #33345
srcao-bingo opened this issue Feb 28, 2024 · 9 comments

Comments

@srcao-bingo
Copy link

srcao-bingo commented Feb 28, 2024

System Info

  • transformers version: 4.36.2
  • Platform: Linux-4.15.0-213-generic-x86_64-with-glibc2.27
  • Python version: 3.9.18
  • Huggingface_hub version: 0.21.1
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

raise ValueError(
ValueError: Please correct the following DeepSpeed config values that mismatch TrainingArguments values:

  • ds scheduler.params.total_num_steps=0 vs hf num_training_steps (calculated)=260
    The easiest method is to set these DeepSpeed config values to 'auto'.

When I use transformers==4.28.1 + deepspeed==0.13.3 for Llama2 fine-tuning, the code runs normally and training is completed. This error occurs when I upgrade transformers to 4.36.x, 4.37.x or 4.38.1 respectively.
And I have not modified the default_offload_opt_param.json file of deepspeed. The contents of the file are as follows:

{
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": "auto",
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 5,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

The value of scheduler.params.total_num_steps is always "auto".

Expected behavior

please fix this bug

@ArthurZucker
Copy link
Collaborator

cc @pacman100 and @SunMarc

@huggingface huggingface deleted a comment from github-actions bot Apr 1, 2024
@ArthurZucker
Copy link
Collaborator

Gentle ping, @pacman100

@huggingface huggingface deleted a comment from github-actions bot Apr 26, 2024
@amyeroberts
Copy link
Collaborator

Another ping @pacman100

@huggingface huggingface deleted a comment from github-actions bot May 21, 2024
@amyeroberts
Copy link
Collaborator

cc @SunMarc @muellerzr

@huggingface huggingface deleted a comment from github-actions bot Jun 16, 2024
@rexxxx1234
Copy link

Any update on this issue?

@Refinath
Copy link

Refinath commented Jul 3, 2024

mee too gets this issue

@muellerzr
Copy link
Contributor

@Refinath or @rexxxx1234 can you please provide the code for your TrainingArguments, Trainer, and your deepspeed config you are using? Thanks!

@Refinath
Copy link

Refinath commented Jul 5, 2024

@muellerzr

training_args = DPOConfig(
    learning_rate=args.lr,
    num_train_epochs=args.epochs,
    per_device_train_batch_size=args.batch_size,
    output_dir='./results',
    logging_steps=10,
    remove_unused_columns=False,
    max_length=1024,
    max_prompt_length=512,
    fp16=True,
    deepspeed="ds_config.json"
)

dpo_trainer = DPOTrainer(
    model,
    ref_model,
    beta=beta,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    args=training_args,
)

deepspeed config

{
  "resource": {
    "num_gpus": 0  
  },
  "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
  },
  "optimizer": {
      "type": "AdamW",
      "params": {
          "lr": "auto",
          "weight_decay": "auto",
          "torch_adam": true,
          "adam_w_mode": true
      }
  },
  "scheduler": {
      "type": "WarmupDecayLR",
      "params": {
          "warmup_min_lr": "auto",
          "warmup_max_lr": "auto",
          "warmup_num_steps": 0,
          "total_num_steps": 58
      }
  },
  "zero_optimization": {
      "stage": 3,
      "allgather_partitions": true,
      "allgather_bucket_size": 2e8,
      "overlap_comm": true,
      "reduce_scatter": true,
      "reduce_bucket_size": "auto",
      "contiguous_gradients": true,
      "stage3_gather_16bit_weights_on_model_save": "auto"
  },
  "gradient_accumulation_steps": 1,
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}


@huggingface huggingface deleted a comment from github-actions bot Jul 30, 2024
@huggingface huggingface deleted a comment from github-actions bot Aug 27, 2024
@huggingface huggingface deleted a comment from github-actions bot Sep 23, 2024
@Ben-Schneider-code
Copy link
Contributor

Ben-Schneider-code commented Oct 2, 2024

Hi @Refinath, I notice you're using trl's DPOTrainer not transformer's Trainer. After looking into it, it seems like trl is not correctly replacing "auto" in the deepspeed config by calling hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps) when loading their reference model. The current fix for this would be to set num_training_steps equal to the number of steps that your dataset has. I'll followup with trl and hopefully that can be resolved.

@srcao-bingo @rexxxx1234 can you please provide the code for your TrainingArguments, Trainer, and your deepspeed config you are using? Are you using transformer's trainer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants