Skip saving frozen parameters if using peft model with deepspeed #26503

VeryLazyBoy · 2023-09-30T03:51:03Z

What does this PR do?

Currently, when using PeftModel, transformers only save the weights of the adapter and support resuming training from these saved weights. However, if deepspeed is used on top of PeftModel, the entire model weights are saved. This behavior differs from that of PeftModel.

This PR integrates a newly added parameter exclude_frozen_weights from deepspeed to skip saving frozen weights if using peft.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@pacman100

github-actions · 2023-10-30T08:03:48Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts · 2023-10-30T12:03:37Z

Gentle ping @pacman100

github-actions · 2023-12-19T08:05:38Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts · 2023-12-19T10:35:36Z

I believe this is resolved in #27825

VeryLazyBoy · 2023-12-19T11:43:20Z

@amyeroberts Hi Amy, you are right. We implemented a similar solution to save only a portion of the weights. However, I'm afraid that issue could arise when attempting to load the weights again, unless the load_module_strict parameter is set to False, as demonstrated in this pull request.

# @@ def deepspeed_load_checkpoint(deepspeed_engine, checkpoint_path):
if len(deepspeed_checkpoint_dirs) > 0:
        logger.info(f"Attempting to resume from {checkpoint_path}")
+
+       load_module_strict = True
+        if version.parse(deepspeed_version) > version.parse("0.10.0"):
+           if is_peft_available() and isinstance(deepspeed_engine.module, PeftModel):
+              load_module_strict = False
        # this magically updates self.optimizer and self.lr_scheduler
        load_path, _ = deepspeed_engine.load_checkpoint(
-        checkpoint_path, load_optimizer_states=True, load_lr_scheduler_states=True
+            checkpoint_path,
+            load_optimizer_states=True,
+            load_lr_scheduler_states=True,
+            load_module_strict=load_module_strict,
        )

amyeroberts · 2023-12-19T20:34:16Z

@VeryLazyBoy Thanks for pointing out! @muellerzr @pacman100 could you give this a first review?

chiragjn · 2023-12-21T12:40:07Z

I think I am running into this issue when resuming,

 File "/data/chirag/llm-finetune/train.py", line 883, in _train
    trainer.train(resume_from_checkpoint=last_checkpoint_dir)
  File "/data/v/ft/lib/python3.10/site-packages/transformers/trainer.py", line 1543, in train
    return inner_training_loop(
  File "/data/v/ft/lib/python3.10/site-packages/transformers/trainer.py", line 1699, in _inner_training_loop
    deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
  File "/data/v/ft/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
    load_path, _ = deepspeed_engine.load_checkpoint(
  File "/data/v/ft/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2725, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
  File "/data/v/ft/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2795, in _load_checkpoint
    self.load_module_state_dict(checkpoint=checkpoint,
  File "/data/v/ft/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2588, in load_module_state_dict
    self.module.load_state_dict(
  File "/data/v/ft/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
        Missing key(s) in state_dict: "base_model.model.model.embed_tokens.weight", "base_model.model.model.layers.0.self_attn.q_proj.weight", ...

The load_module_strict=False is necessary here to load from peft and resume. I tested it ad-hoc on top of main

kazemf78 · 2023-12-24T08:36:33Z

@amyeroberts Hi Amy, you are right. We implemented a similar solution to save only a portion of the weights. However, I'm afraid that issue could arise when attempting to load the weights again, unless the load_module_strict parameter is set to False, as demonstrated in this pull request.

# @@ def deepspeed_load_checkpoint(deepspeed_engine, checkpoint_path):
if len(deepspeed_checkpoint_dirs) > 0:
        logger.info(f"Attempting to resume from {checkpoint_path}")
+
+       load_module_strict = True
+        if version.parse(deepspeed_version) > version.parse("0.10.0"):
+           if is_peft_available() and isinstance(deepspeed_engine.module, PeftModel):
+              load_module_strict = False
        # this magically updates self.optimizer and self.lr_scheduler
        load_path, _ = deepspeed_engine.load_checkpoint(
-        checkpoint_path, load_optimizer_states=True, load_lr_scheduler_states=True
+            checkpoint_path,
+            load_optimizer_states=True,
+            load_lr_scheduler_states=True,
+            load_module_strict=load_module_strict,
        )

I encountered the same issue previously. Moreover, when resuming from the checkpoint in Deepspeed (with DeepSpeedEngine.load_checkpoint you mentioned), some objects like the lr_scheduler object are instantiated from scratch which means the lr_scheduler cycle in fact does not resume and starts again, as Deepspeed does not save the lr_scheduler at all. This is because Deepspeed does not support default lr_scheduler type used in transformers i.e. LambdaLR.

Currently, Transformers delegates checkpointing to Deepspeed if it is enabled, and in that scenario lr_scheduler is not saved and loaded in checkpoints. I tweaked the Transformers a bit to handle this scenario with minimal changes in this commit.

VeryLazyBoy · 2023-12-24T11:28:01Z

Deepspeed does not save the lr_scheduler at all. This is because Deepspeed does not support default lr_scheduler type used in transformers i.e. LambdaLR

Hi @kazemf78 , have you checked this #25863? By the way, the issue you mentioned is not related to this PR, if there indeed exist a bug, you can create a separate issue or pr :)

chiragjn · 2024-01-15T07:36:45Z

The load_module_strict is a required change for resuming to work correctly, @amyeroberts can we please get that merged?
Is the blocker that partial work of this PR is already merged? Should we create another PR quickly?

amyeroberts · 2024-01-15T15:49:17Z

cc @pacman100 for first review

pacman100 · 2024-02-26T11:50:41Z

Hello, Thank you for @VeryLazyBoy for your PR. I'm extremely sorry and apologize that this PR got missed. However, this issue is fixed in #28746.

VeryLazyBoy · 2024-02-26T11:55:45Z

Thank you for providing this information. I am pleased to hear that the issue has been resolved. @pacman100

VeryLazyBoy added 2 commits September 30, 2023 11:46

Skip saving frozon parameters if using peft model with deepspeed

77a9760

Fix ruff errors

95d0841

huggingface deleted a comment from github-actions bot Nov 24, 2023

muellerzr requested a review from pacman100 December 20, 2023 15:22

huggingface deleted a comment from github-actions bot Feb 9, 2024

VeryLazyBoy closed this Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip saving frozen parameters if using peft model with deepspeed #26503

Skip saving frozen parameters if using peft model with deepspeed #26503

VeryLazyBoy commented Sep 30, 2023 •

edited

Loading

github-actions bot commented Oct 30, 2023

amyeroberts commented Oct 30, 2023

github-actions bot commented Dec 19, 2023

amyeroberts commented Dec 19, 2023

VeryLazyBoy commented Dec 19, 2023

amyeroberts commented Dec 19, 2023

chiragjn commented Dec 21, 2023

kazemf78 commented Dec 24, 2023

VeryLazyBoy commented Dec 24, 2023

chiragjn commented Jan 15, 2024 •

edited

Loading

amyeroberts commented Jan 15, 2024

pacman100 commented Feb 26, 2024 •

edited

Loading

VeryLazyBoy commented Feb 26, 2024

Skip saving frozen parameters if using peft model with deepspeed #26503

Skip saving frozen parameters if using peft model with deepspeed #26503

Conversation

VeryLazyBoy commented Sep 30, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

github-actions bot commented Oct 30, 2023

amyeroberts commented Oct 30, 2023

github-actions bot commented Dec 19, 2023

amyeroberts commented Dec 19, 2023

VeryLazyBoy commented Dec 19, 2023

amyeroberts commented Dec 19, 2023

chiragjn commented Dec 21, 2023

kazemf78 commented Dec 24, 2023

VeryLazyBoy commented Dec 24, 2023

chiragjn commented Jan 15, 2024 • edited Loading

amyeroberts commented Jan 15, 2024

pacman100 commented Feb 26, 2024 • edited Loading

VeryLazyBoy commented Feb 26, 2024

VeryLazyBoy commented Sep 30, 2023 •

edited

Loading

chiragjn commented Jan 15, 2024 •

edited

Loading

pacman100 commented Feb 26, 2024 •

edited

Loading