Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug with rotating checkpoints #28009

Merged
merged 7 commits into from
Dec 13, 2023
Merged

Fix bug with rotating checkpoints #28009

merged 7 commits into from
Dec 13, 2023

Conversation

muellerzr
Copy link
Contributor

What does this PR do?

There was a bug introduced in #27820 where if we were on multi-GPU systems we would hit a race condition after saving on the processes because we cannot rename the staging directory multiple times. This PR ensures that it only happens on the main process.

Fixes # (issue)

Fixes #27925
Alternative to #27929

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts

I would recommend a patch release as this is fully blocking users on multi-GPU after the last release.

@thundergolfer
Copy link
Contributor

thundergolfer commented Dec 13, 2023

Somewhat of an aside, but there's no guarantee that a previous writer has created the directory before this point: https://github.com/huggingface/transformers/pull/28009/files#diff-ed55888e6665791fe92cc8fc0c499da54f4ace6738551cd9a2591881cda076deR2379

I've seen recently that a process entering this function can skip past save operations which would create the directory and arrive at this point before another process (the 'main') has a chance to create the directory.


Also, is should_save only ever True for a single process, the main process? If so then it's a misnomer. It's documented as:

Whether or not the current process should write to disk, e.g., to save models and checkpoints.

But in a multi-GPU scenario, multiple processes participate in disk writing against the checkpoint directory.

PS. Sorry for the bug! I didn't test my original change on multi-GPU.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing and adding a test!

cc @ArthurZucker who's preparing a patch release

@muellerzr
Copy link
Contributor Author

@thundergolfer re:

Whether or not the current process should write to disk, e.g., to save models and checkpoints.

Yes. You'll find that it's used sparingly during saving of the weights, but the internal check is that we're on process 0

@muellerzr
Copy link
Contributor Author

muellerzr commented Dec 13, 2023

@thundergolfer re;

I've seen recently that a process entering this function can skip past save operations which would create the directory and arrive at this point before another process (the 'main') has a chance to create the directory.

Can you give an example so I can contextualize the logic in the code to see where we need to fix/make the directory instead?

My best guess is you don't have a model?

I can put an os.mkdir there instead as an option, but would be good for us to be able to write a test for it.

@thundergolfer
Copy link
Contributor

Can you give an example...

It was happening in multi-GPU scenario when using the axolotl framework. Figured it was because a non-main process was skipping all the model saving steps and doing the unconditional save of state before main process created the directory. Can look to contribute a failing test which would motivate the change 👍

@muellerzr
Copy link
Contributor Author

muellerzr commented Dec 13, 2023

That would be great @thundergolfer. I'm hesitant to do too much in this one PR, since this addresses the main issue for base users of the Trainer. A follow up (possibly not part of this patch?) where we can look in-depth at it for axolotl and ensure that it doesn't break other parts would be good.

@muellerzr muellerzr merged commit 9376625 into main Dec 13, 2023
21 checks passed
@muellerzr muellerzr deleted the del-only-on-main branch December 13, 2023 17:17
@tangwiki
Copy link

It works. Thanks for the quick fix. @muellerzr

ArthurZucker pushed a commit that referenced this pull request Dec 14, 2023
* Fix bug

* Write test

* Keep back old modification for grad accum steps

* Whitespace...

* Whitespace again

* Race condition

* Wait for everyone
@dumpmemory
Copy link
Contributor

it didn't fix the issue when multi-node training.
transformer 4.36.1
'''
FileNotFoundError: [Errno 2] No such file or directory: '/tmp-checkpoint-14696' -> '/checkpoint-14696'
'''

@thundergolfer
Copy link
Contributor

@dumpmemory do you have the full stack-trace and perhaps a small reproduction script?

I think there's still a race condition where checking for directory existence is not atomic with the rename attempt. If it's possible that the dir has already been moved, should just attempt the rename and catch the potential exception.

https://github.com/huggingface/transformers/pull/28009/files#diff-ed55888e6665791fe92cc8fc0c499da54f4ace6738551cd9a2591881cda076deR2390

Also it seems simpler to have only the .should_save == True (main process) do the rename. The rename can only succeed once, and only the main process should perform it.

@dumpmemory
Copy link
Contributor

@dumpmemory do you have the full stack-trace and perhaps a small reproduction script?

I think there's still a race condition where checking for directory existence is not atomic with the rename attempt. If it's possible that the dir has already been moved, should just attempt the rename and catch the potential exception.

https://github.com/huggingface/transformers/pull/28009/files#diff-ed55888e6665791fe92cc8fc0c499da54f4ace6738551cd9a2591881cda076deR2390

Also it seems simpler to have only the .should_save == True (main process) do the rename. The rename can only succeed once, and only the main process should perform it.

I am training with multi-gpu setting and shared file system. so each node's rank 0 process try to do the rename which is a race condition.

iantbutler01 pushed a commit to BismuthCloud/transformers that referenced this pull request Dec 16, 2023
* Fix bug

* Write test

* Keep back old modification for grad accum steps

* Whitespace...

* Whitespace again

* Race condition

* Wait for everyone
@jiezhangGt
Copy link

it didn't fix the issue when multi-node training.
transformer 4.36.2
'''
FileNotFoundError: [Errno 2] No such file or directory: '/tmp-checkpoint-5' -> '/checkpoint-5'

@dumpmemory
Copy link
Contributor

it didn't fix the issue when multi-node training. transformer 4.36.2 ''' FileNotFoundError: [Errno 2] No such file or directory: '/tmp-checkpoint-5' -> '/checkpoint-5'

pls check the main branch. it might be the issue of nfs

staghado pushed a commit to staghado/transformers that referenced this pull request Jan 15, 2024
* Fix bug

* Write test

* Keep back old modification for grad accum steps

* Whitespace...

* Whitespace again

* Race condition

* Wait for everyone
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save model checkpoint error when multi-gpu training
7 participants