-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: non-atomic checkpoint save #27820
Conversation
Test failures appear unrelated to diff:
Also, the checkpoint loading code could arguably use more sophisticated checkpoint validation and automatic fallback in cases of validation failure, but that's a follow-up issue. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thundergolfer! Overall this looks good to me, but just to be sure on the tests can you try rebasing from main? (And push with --force
).
Thanks @muellerzr. Just rebased. Edit: Ok that revealed an issue! Are the CI tests fail fast? I didn't see test failures related to my diff before 🤔 |
@thundergolfer you should see them now. Looks like you may not be cleaning up the temp dirs right during testing? (I'd also just maybe remove the tempdir if possible after you're done needing it?) |
@muellerzr there's actually some test cases that reveal a behavior change in this diff. If a trainer run tries to reuse a checkpoint directory, with this diff that will now fail. I'm not sure how common this is in practice, but one of the test cases writes checkpoints For now I can adjust the change to make it compatible with non-empty destination checkpoint directories, and log a warning that the checkpoint will be non-atomic. The alternative of deleting everything in the destination directory is destructive and could break things for people. |
@muellerzr alright tests are passing now. There's still code quality CI steps failing because it doesn't find a |
Don't quite see the token issue, can you do |
You can also try rebasing again, as I assume this got fixed somewhere else |
Will rebase |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank a lot looks nice! 🤗
Hmm I'm getting an error running with deepspeed. Is this what #27929 is addressing?
|
What does this PR do?
transformers.trainer.Trainer
currently does not do atomic checkpointing. On checkpoint save, the code creates a checkpoint directory and then progressively adds more files to it. An exception can occur for various reasons in the middle of the checkpoint save process, leaving the checkpoint directory in an invalid state.Some examples that can result in partial checkpoint save:
On restore, the trainer just looks up the latest available checkpoint by directory name. It does not check for and ignore directories with partial checkpoints.
This change adjusts the
_save_checkpoint
method to do all writing into a 'temporary directory' which is renamed to its final name only after all save operations have succeeded.I don't put the temporary directory in
/tmp/
because that's sometimes a different filesystem mount to where checkpoints are saved and renaming in that situation producesOSError: [Errno 18] Invalid cross-device link
. I don't useshutil.move
because that's not atomic for directories.Fixes: # (didn't bother making one, just submitted PR :))
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@muellerzr, @pacman100