Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip saving frozen parameters if using peft model with deepspeed #26503

Conversation

VeryLazyBoy
Copy link
Contributor

@VeryLazyBoy VeryLazyBoy commented Sep 30, 2023

What does this PR do?

Currently, when using PeftModel, transformers only save the weights of the adapter and support resuming training from these saved weights. However, if deepspeed is used on top of PeftModel, the entire model weights are saved. This behavior differs from that of PeftModel.

This PR integrates a newly added parameter exclude_frozen_weights from deepspeed to skip saving frozen weights if using peft.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@pacman100

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@amyeroberts
Copy link
Collaborator

Gentle ping @pacman100

@huggingface huggingface deleted a comment from github-actions bot Nov 24, 2023
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@amyeroberts
Copy link
Collaborator

I believe this is resolved in #27825

@VeryLazyBoy
Copy link
Contributor Author

@amyeroberts Hi Amy, you are right. We implemented a similar solution to save only a portion of the weights. However, I'm afraid that issue could arise when attempting to load the weights again, unless the load_module_strict parameter is set to False, as demonstrated in this pull request.

# @@ def deepspeed_load_checkpoint(deepspeed_engine, checkpoint_path):
if len(deepspeed_checkpoint_dirs) > 0:
        logger.info(f"Attempting to resume from {checkpoint_path}")
+
+       load_module_strict = True
+        if version.parse(deepspeed_version) > version.parse("0.10.0"):
+           if is_peft_available() and isinstance(deepspeed_engine.module, PeftModel):
+              load_module_strict = False
        # this magically updates self.optimizer and self.lr_scheduler
        load_path, _ = deepspeed_engine.load_checkpoint(
-        checkpoint_path, load_optimizer_states=True, load_lr_scheduler_states=True
+            checkpoint_path,
+            load_optimizer_states=True,
+            load_lr_scheduler_states=True,
+            load_module_strict=load_module_strict,
        )

@amyeroberts
Copy link
Collaborator

@VeryLazyBoy Thanks for pointing out! @muellerzr @pacman100 could you give this a first review?

@muellerzr muellerzr requested a review from pacman100 December 20, 2023 15:22
@chiragjn
Copy link

I think I am running into this issue when resuming,

 File "/data/chirag/llm-finetune/train.py", line 883, in _train
    trainer.train(resume_from_checkpoint=last_checkpoint_dir)
  File "/data/v/ft/lib/python3.10/site-packages/transformers/trainer.py", line 1543, in train
    return inner_training_loop(
  File "/data/v/ft/lib/python3.10/site-packages/transformers/trainer.py", line 1699, in _inner_training_loop
    deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
  File "/data/v/ft/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
    load_path, _ = deepspeed_engine.load_checkpoint(
  File "/data/v/ft/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2725, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
  File "/data/v/ft/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2795, in _load_checkpoint
    self.load_module_state_dict(checkpoint=checkpoint,
  File "/data/v/ft/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2588, in load_module_state_dict
    self.module.load_state_dict(
  File "/data/v/ft/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
        Missing key(s) in state_dict: "base_model.model.model.embed_tokens.weight", "base_model.model.model.layers.0.self_attn.q_proj.weight", ...

The load_module_strict=False is necessary here to load from peft and resume. I tested it ad-hoc on top of main

@kazemf78
Copy link

@amyeroberts Hi Amy, you are right. We implemented a similar solution to save only a portion of the weights. However, I'm afraid that issue could arise when attempting to load the weights again, unless the load_module_strict parameter is set to False, as demonstrated in this pull request.

# @@ def deepspeed_load_checkpoint(deepspeed_engine, checkpoint_path):
if len(deepspeed_checkpoint_dirs) > 0:
        logger.info(f"Attempting to resume from {checkpoint_path}")
+
+       load_module_strict = True
+        if version.parse(deepspeed_version) > version.parse("0.10.0"):
+           if is_peft_available() and isinstance(deepspeed_engine.module, PeftModel):
+              load_module_strict = False
        # this magically updates self.optimizer and self.lr_scheduler
        load_path, _ = deepspeed_engine.load_checkpoint(
-        checkpoint_path, load_optimizer_states=True, load_lr_scheduler_states=True
+            checkpoint_path,
+            load_optimizer_states=True,
+            load_lr_scheduler_states=True,
+            load_module_strict=load_module_strict,
        )

I encountered the same issue previously. Moreover, when resuming from the checkpoint in Deepspeed (with DeepSpeedEngine.load_checkpoint you mentioned), some objects like the lr_scheduler object are instantiated from scratch which means the lr_scheduler cycle in fact does not resume and starts again, as Deepspeed does not save the lr_scheduler at all. This is because Deepspeed does not support default lr_scheduler type used in transformers i.e. LambdaLR.

Currently, Transformers delegates checkpointing to Deepspeed if it is enabled, and in that scenario lr_scheduler is not saved and loaded in checkpoints. I tweaked the Transformers a bit to handle this scenario with minimal changes in this commit.

@VeryLazyBoy
Copy link
Contributor Author

Deepspeed does not save the lr_scheduler at all. This is because Deepspeed does not support default lr_scheduler type used in transformers i.e. LambdaLR

Hi @kazemf78 , have you checked this #25863? By the way, the issue you mentioned is not related to this PR, if there indeed exist a bug, you can create a separate issue or pr :)

@chiragjn
Copy link

chiragjn commented Jan 15, 2024

The load_module_strict is a required change for resuming to work correctly, @amyeroberts can we please get that merged?
Is the blocker that partial work of this PR is already merged? Should we create another PR quickly?

@amyeroberts
Copy link
Collaborator

cc @pacman100 for first review

@huggingface huggingface deleted a comment from github-actions bot Feb 9, 2024
@pacman100
Copy link
Contributor

pacman100 commented Feb 26, 2024

Hello, Thank you for @VeryLazyBoy for your PR. I'm extremely sorry and apologize that this PR got missed. However, this issue is fixed in #28746.

@VeryLazyBoy
Copy link
Contributor Author

Thank you for providing this information. I am pleased to hear that the issue has been resolved. @pacman100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants