-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support resuming of deepspeed + Lora + offloading #29015
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, in principle I would say this looks good, might be not ideal to change the default value of load_module_strict
for the public method as other users might be using it.
On DS side, I'll let @pacman100 comment on the PR 🙏
@@ -414,7 +414,7 @@ def deepspeed_init(trainer, num_training_steps, inference=False): | |||
return optimizer, lr_scheduler | |||
|
|||
|
|||
def deepspeed_load_checkpoint(deepspeed_engine, checkpoint_path, load_module_strict=True): | |||
def deepspeed_load_checkpoint(deepspeed_engine, checkpoint_path, load_module_strict=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we somehow revert this and just force-set it to True
in our trainer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point! I'll push a change now
Could you please provide any updates on this PR? |
Sure @thepowerfuldeez ! |
Hello, this has been already fixed in #28746. I ran experiments today and can confirm resuming training when using PEFT+DeepSpeed works |
|
||
# deepspeed ckpt loading | ||
if resume_from_checkpoint is not None and self.is_deepspeed_enabled: | ||
deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint, load_module_strict=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already happenging a couple lines above in 1732.
if resume_from_checkpoint is not None and self.is_deepspeed_enabled: | ||
deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint, load_module_strict=False) | ||
if self.args.deepspeed_force_lr_scheduler_checkpointing and self.model_wrapped.lr_scheduler is None: | ||
if os.path.isfile(os.path.join(resume_from_checkpoint, SCHEDULER_NAME)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loading scheduler is handled in _load_optimizer_and_scheduler
which is couple lines below
@@ -1316,6 +1316,18 @@ class TrainingArguments: | |||
"help": "Activates neftune noise embeddings into the model. NEFTune has been proven to drastically improve model performances for instrcution fine-tuning. Check out the original paper here: https://arxiv.org/abs/2310.05914 and the original code here: https://github.com/neelsjain/NEFTune. Only supported for `PreTrainedModel` and `PeftModel` classes." | |||
}, | |||
) | |||
|
|||
deepspeed_force_lr_scheduler_checkpointing: bool = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.is_deepspeed_enabled: | ||
# under zero3 model file itself doesn't get saved since it's bogus! Unless deepspeed | ||
# config `stage3_gather_16bit_weights_on_model_save` is True | ||
self.model_wrapped.save_checkpoint(staging_output_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this happens in self.save_model(staging_output_dir, _internal_call=True)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I'd like to bump this I'm running into the this issue on my project and it's causing significant delays. Does this branch solve the issue but is not tested enough for merging to main or is a solution still to be found? |
Hi @ambroser53 ! I haven’t tested this branch on an upstream, but it should work. |
Please see my comments above, the transformers PR #28746 should already be fixing this. |
This PR is a upstream version of @kazemf78 PR to support resuming of Lora training when using deepspeed.
Without setting
load_module_strict=False
as a default, checkpoint is not loaded due to Lora not containing all weights, throwing an errordeepspeed resume Error(s) in loading state_dict for PeftModelForCausalLM
Related discussion: huggingface/peft#746