-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: handle multiprocess properly in trainer checkpointing #27929
Conversation
Test failures are for "documentation" and "transformers metadata", same as last time (#27820 (comment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix, have you confirmed this works on a multi-GPU system?
Yes, that's detailed in the PR description, starting with the sentence: "I didn't setup a multi-GPU VM to run the test, ..." Also if you agree with the TODO, I'm happy to make a follow-up PR addressing it 🙂 |
Sorry for missing it! I'll run it locally here and get back to you on if the solution indeed works. If so, yes a follow up PR for that would be great :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
@muellerzr Once you've confirmed things work on your side happy for it to be merged :)
Any Update here? Thanks! |
any update here.? waiting for the PR merge |
I tried the changes from this PR, but I got other issue as follows:
I realized that why we don't just check the existence of the folder like this: ...
if os.path.exists(staging_output_dir):
if self.args.should_save:
self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
if self.args.push_to_hub:
self._push_from_checkpoint(staging_output_dir)
# Place checkpoint in final location after all saving is finished.
if staging_output_dir != output_dir:
os.rename(staging_output_dir, output_dir)
... This works smoothly. |
# TODO: move out of function. This is not checkpointing, and in multi-device training | ||
# involves coordination b/w processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is checkpointing as we modify self.state.total_flos
which gets saved as part of the checkpoint
if self.hp_search_backend is None and trial is None: | ||
self.store_flos() | ||
|
||
# Beyond this point, only a single writer should proceed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily, as when doing things like FSDP or DeepSpeed we save on every worker, this is not the right solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn, ok can attempt to make this safe for multiple writers.
Is there existing testing that captures FSDP and DeepSpeed multi-writer functionality?
@thundergolfer I have a different fix coming in that works, the issue is you were not checking that the rename of the staging folder was happening just on the main process: #28009 |
Closing in favor of #28009 as this change still doesn't handle all multi-GPU scenarios. |
What does this PR do?
Follow-up to #27820 which is bugged for multi-device/multiprocess training. I made the error of thinking that in multiprocess training the
._save_checkpoint()
method was already restricted to a single writer.I've fixed that now and augmented an existing multiprocess test to validate checkpointing functionality.
I've also noted with a
TODO
something I found pretty confusing in the current code.store_flos()
isn't checkpointing related in my opinion, but it does anall_gather
and thus if all processes don't enter thestore_flos()
fn the training program hangs. In my opinion this code should be moved out of the checkpointing method so that this method conceptually supports entrance and execution by a single writer (the process withself.args.should_save == True
).I didn't setup a multi-GPU VM to run the test, but this multi-GPU Modal script runs and passes the test:
Fixes #27925
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@muellerzr, @pacman100