-
Notifications
You must be signed in to change notification settings - Fork 971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All layer weights not getting saved when using accelerate + FSDP for OPT 2.7B #1054
Comments
cc @pacman100 |
Hello @lvnair3, could you change the following in the unwrapped_model.save_pretrained(
args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save,
+ state_dict=accelerator.get_state_dict(model)
) |
@pacman100 That worked! Thank you! |
Great! Will document this in the Accelerate docs and i teractive code explorer tool here https://huggingface.co/docs/accelerate/usage_guides/explore for preventing others from facing this issue. Feel free to close this issue 😄 |
Saving in other scripts is broken too and the proposed change fixes them. For example run_summarization_no_trainer.py: ? I could make a PR, but I don't have a permission. |
@valentas-kurauskas you need to fork the repo and open a PR to do so. (And would be welcomed!) |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
SETUP
Task: Finetuning
facebook/opt-2.7b
for Wikitext2Code is: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm_no_trainer.py
Dataset is: Wikitext2
The command used to run this is as follows:
ERROR
Error occurs when loading the resultant checkpoint into the evaluation code here: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py
The command used to perform evaluation is:
Error message is that certain weights were not initialized:
The sharded model checkpoint folder has the following
pytorch_model.bin.index.json
file:Expected behavior
The text was updated successfully, but these errors were encountered: