Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to load universal_checkpoint with deepspeed integreation #33157

Closed
4 tasks
Tracked by #33345
huyiwen opened this issue Aug 28, 2024 · 6 comments · Fixed by #35015
Closed
4 tasks
Tracked by #33345

Failed to load universal_checkpoint with deepspeed integreation #33157

huyiwen opened this issue Aug 28, 2024 · 6 comments · Fixed by #35015
Labels

Comments

@huyiwen
Copy link
Contributor

huyiwen commented Aug 28, 2024

System Info

  • transformers version: 4.44.2
  • Platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.17
  • Python version: 3.8.18
  • Huggingface_hub version: 0.24.6
  • Safetensors version: 0.4.4
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA A800 80GB PCIe

Who can help?

@muellerzr

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The Universal Checkpointing feature allows loading with different world sizes. However, when using the Hugging Face Trainer, the loading of the converted universal checkpoint fails.

The failure seems to be due to HfTrainerDeepSpeedConfig not correctly handling the "load_universal_checkpoint": true or "universal_checkpoint": true arguments in the DeepSpeed configuration. Consequently, the load_universal_checkpoint function returns False.

Related Issues:

Expected behavior

Universal checkpoint should be loaded correctly.

@huyiwen huyiwen added the bug label Aug 28, 2024
@huyiwen
Copy link
Contributor Author

huyiwen commented Aug 30, 2024

Here's my deepspeed config json:

{
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 1e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 1e8,
    "contiguous_gradients": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 16,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false,
  "activation_checkpointing": {
    "partition_activations": false,
    "cpu_checkpointing": true,
    "contiguous_memory_optimization": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "no_pipeline_parallel": true,
  "load_universal_checkpoint": true
}

@huyiwen
Copy link
Contributor Author

huyiwen commented Aug 30, 2024

Another related issue: microsoft/DeepSpeed#5405

@huyiwen
Copy link
Contributor Author

huyiwen commented Sep 7, 2024

Hello @ArthurZucker and @muellerz. I am able to create a pull request to address the issue. I have resolved the issue by deleting all the “rng_state” files as it had a different world size.

Before I start with the PR, I would like to ensure that NOT loading these “rng_state” files does not have any side-effects.

@huyiwen
Copy link
Contributor Author

huyiwen commented Sep 7, 2024

We can skip these rng_state and add a warning.

@ArthurZucker
Copy link
Collaborator

Sure feel free to open a PR!

@github-staff github-staff deleted a comment from ViniciusSCG Oct 1, 2024
@github-staff github-staff deleted a comment from ViniciusSCG Oct 1, 2024
@huggingface huggingface deleted a comment from github-actions bot Oct 29, 2024
@huggingface huggingface deleted a comment from github-actions bot Nov 25, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
@huyiwen @ArthurZucker and others