-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let Huggingface Properly Initialize Arguments, and Fix FSDP-LORA Checkpoint-Saves and Resumption #53
Conversation
809fb2f
to
7196479
Compare
I noticed that any changes made here might also need to be reflected here https://github.ibm.com/ai-foundation/sft-trainer-image/blob/main/launch_training.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- left some questions and comments
- DCO check is failing
- Now that we have linted and formatted our files in main branch, the PR will need to be rebased on latest main . you need to update your fork to the latest main and then this branch with
git merge main
sorry for the trouble, but any major changes upstream with linting and formatting is done now, so its a one-time pain :)
Thank you for the contribution
also due to lack of unit tests at the moment, Please confirm prompt tuning still works in single gPU env with this branch. I know this PR is verified using multiple GPUs. Is it also verified in single gPU environment? |
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
@Ssukriti I have rebased the changes and linted.
Yes and I have verified that it works for:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @fabianlim for thoroughly testing your PR!
cant merge because of failing pylint, which seems to be a valid warning. Since this is your first commit Fabian, I have to manually start the workflow checks - pylint etc, thats why you didnt catch it earlier. Once this PR is merged and your fork is recognized as a contributor, going forward the pylint checks with run automatically on your PRs and you wont have to wait for an admin to start checks :) to run pylint locally on your machine you can do Our contributing guides should be merged soon. I have approved the PR and will merge as soon as pylint checks pass |
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!!!
…kpoint-Saves and Resumption (foundation-model-stack#53) * training args should call post init to intialize all HF flags Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * remove run_distribtued flag and peft_saving callback Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * revert deletion of validation checks on some train args Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * revert the addition of __post_init__ as it is actually not needed Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> --------- Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Co-authored-by: Sukriti Sharma <Ssukriti@users.noreply.github.com>
…kpoint-Saves and Resumption (foundation-model-stack#53) * training args should call post init to intialize all HF flags Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * remove run_distribtued flag and peft_saving callback Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * revert deletion of validation checks on some train args Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * revert the addition of __post_init__ as it is actually not needed Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> --------- Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Co-authored-by: Sukriti Sharma <Ssukriti@users.noreply.github.com>
@raghukiran1224 @Ssukriti @anhuong Im suggesting two fixes here
xla_fsdp_v2
flag here. The current strategy is to manually patch missing flags becausetransformers.TrainingArguments.__post_init__
is not called. But this means one has to constantly update the manual patches if there are changes to HF code, which is not ideal. Huggingface trainer will properly initialize everything based onTrainingArguments
(including gradient checkpointing.PeftSavingCallback
is not the ideal patch now that Support saving only PEFT adapter in checkpoints when using PEFT + FSDP huggingface/transformers#28297 has been merged, where now FSDP will properly save and resume adapter checkpoints. This will supercede thePeftSavingCallback
strategy, asPeftSavingCallback
will not even properly handle resumptions and different state dict saving strategies, but the already merged fix will. thePEFTCallback
is removed intrl
here