Support `modules_to_save` config option when using DeepSpeed ZeRO-3 with ZeRO init enabled. #1450

pacman100 · 2024-02-09T11:45:58Z

What does this PR do?

When using DeepSpeed ZeRO Stage-3 with ZeRO init enabled, the deeepcopy performed in ModulesToSaveWrapper class doesn't work as expected and creates a new module with 0 parameters. As such, this results in following error when training:

assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError    return self.modules_to_save[self.active_adapter](*args, **kwargs)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
: {'id': 249, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {451}, 'ds_tensor.shape': torch.Size([0])}

This PR resolves this issue. To resolve it, the parameters of the modules specified in modules_to_save config option are gathered across processes using deepspeed.zero.GatheredParameters.
Example tested: https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py with following changes:

...
+ from peft import get_peft_model, LoraConfig
...
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
+ config = LoraConfig(r=8, lora_alpha=16, task_type="SEQ_CLS")
+ model  = get_peft_model(model, config)
+ print(model)
...

- config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": 16}
+ config = {"lr": 2e-4, "num_epochs": 10, "seed": 42, "batch_size": 16}

without this PR, the above stated error is given and with this PR the fine-tuning happens successfully.

epoch 7: {'accuracy': 0.8480392156862745, 'f1': 0.8945578231292517}
[2024-02-09 12:26:16,228] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
epoch 8: {'accuracy': 0.8504901960784313, 'f1': 0.8957264957264958}
/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/autograd/__init__.py:266: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
epoch 9: {'accuracy': 0.8602941176470589, 'f1': 0.902229845626072}

launch command:

accelerate launch --use_deepspeed --num_processes=2 --zero_stage=3 --zero3_init_flag=True --zero3_save_16bit_model=True --gradient_accumulation_steps=1 --gradient_clipping=1 --mixed_precision=fp16 nlp_example.py --mixed_precision fp16

Fixes: huggingface/transformers#24445 (comment)

HuggingFaceDocBuilderDev · 2024-02-09T11:49:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan

Thanks for making modules_to_save work with DeepSpeed Zero-3. LGTM.

…ith ZeRO init enabled. (huggingface#1450) * Update other.py * Update other.py * fix quality * Update other.py

pacman100 added 2 commits February 9, 2024 16:48

Update other.py

13264a2

Update other.py

0e1b5fa

fix quality

bf65be0

BenjaminBossan approved these changes Feb 9, 2024

View reviewed changes

Update other.py

be8a0f1

pacman100 mentioned this pull request Feb 9, 2024

Lora + DeepSpeed non-trainer integration does not work huggingface/transformers#28770

Closed

4 tasks

pacman100 marked this pull request as ready for review February 9, 2024 12:12

pacman100 merged commit a1c472f into main Feb 9, 2024
14 checks passed

younesbelkada mentioned this pull request Feb 16, 2024

[BUG] Unexpected GPU memory consumption when using transformers PEFT in DeepSpeed Zero3 huggingface/transformers#29047

Closed

4 tasks

pacman100 deleted the smangrul/fix-modules-to-save-for-ds-z3-init branch February 20, 2024 05:46

jklj077 mentioned this pull request Mar 1, 2024

[BUG] <title> 'ZeRO3 is incompatible with LoRA when finetuning on base model.' QwenLM/Qwen#1104

Closed

2 tasks

BenjaminBossan pushed a commit to BenjaminBossan/peft that referenced this pull request Mar 14, 2024

Support modules_to_save config option when using DeepSpeed ZeRO-3 w…

7a68938

…ith ZeRO init enabled. (huggingface#1450) * Update other.py * Update other.py * fix quality * Update other.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `modules_to_save` config option when using DeepSpeed ZeRO-3 with ZeRO init enabled. #1450

Support `modules_to_save` config option when using DeepSpeed ZeRO-3 with ZeRO init enabled. #1450

pacman100 commented Feb 9, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 9, 2024

BenjaminBossan left a comment

Support modules_to_save config option when using DeepSpeed ZeRO-3 with ZeRO init enabled. #1450

Support modules_to_save config option when using DeepSpeed ZeRO-3 with ZeRO init enabled. #1450

Conversation

pacman100 commented Feb 9, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Feb 9, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

Support `modules_to_save` config option when using DeepSpeed ZeRO-3 with ZeRO init enabled. #1450

Support `modules_to_save` config option when using DeepSpeed ZeRO-3 with ZeRO init enabled. #1450

pacman100 commented Feb 9, 2024 •

edited

Loading