-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug with rotating checkpoints #28009
Conversation
Somewhat of an aside, but there's no guarantee that a previous writer has created the directory before this point: https://github.com/huggingface/transformers/pull/28009/files#diff-ed55888e6665791fe92cc8fc0c499da54f4ace6738551cd9a2591881cda076deR2379 I've seen recently that a process entering this function can skip past save operations which would create the directory and arrive at this point before another process (the 'main') has a chance to create the directory. Also, is
But in a multi-GPU scenario, multiple processes participate in disk writing against the checkpoint directory. PS. Sorry for the bug! I didn't test my original change on multi-GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing and adding a test!
cc @ArthurZucker who's preparing a patch release
@thundergolfer re:
Yes. You'll find that it's used sparingly during saving of the weights, but the internal check is that we're on process 0 |
@thundergolfer re;
Can you give an example so I can contextualize the logic in the code to see where we need to fix/make the directory instead? My best guess is you don't have a model? I can put an |
It was happening in multi-GPU scenario when using the axolotl framework. Figured it was because a non-main process was skipping all the model saving steps and doing the unconditional save of state before main process created the directory. Can look to contribute a failing test which would motivate the change 👍 |
That would be great @thundergolfer. I'm hesitant to do too much in this one PR, since this addresses the main issue for base users of the Trainer. A follow up (possibly not part of this patch?) where we can look in-depth at it for axolotl and ensure that it doesn't break other parts would be good. |
It works. Thanks for the quick fix. @muellerzr |
* Fix bug * Write test * Keep back old modification for grad accum steps * Whitespace... * Whitespace again * Race condition * Wait for everyone
it didn't fix the issue when multi-node training. |
@dumpmemory do you have the full stack-trace and perhaps a small reproduction script? I think there's still a race condition where checking for directory existence is not atomic with the rename attempt. If it's possible that the dir has already been moved, should just attempt the rename and catch the potential exception. Also it seems simpler to have only the |
I am training with multi-gpu setting and shared file system. so each node's rank 0 process try to do the rename which is a race condition. |
* Fix bug * Write test * Keep back old modification for grad accum steps * Whitespace... * Whitespace again * Race condition * Wait for everyone
it didn't fix the issue when multi-node training. |
pls check the main branch. it might be the issue of nfs |
* Fix bug * Write test * Keep back old modification for grad accum steps * Whitespace... * Whitespace again * Race condition * Wait for everyone
What does this PR do?
There was a bug introduced in #27820 where if we were on multi-GPU systems we would hit a race condition after saving on the processes because we cannot rename the staging directory multiple times. This PR ensures that it only happens on the main process.
Fixes # (issue)
Fixes #27925
Alternative to #27929
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@amyeroberts
I would recommend a patch release as this is fully blocking users on multi-GPU after the last release.