Fix bug with rotating checkpoints #28009

muellerzr · 2023-12-13T16:26:52Z

What does this PR do?

There was a bug introduced in #27820 where if we were on multi-GPU systems we would hit a race condition after saving on the processes because we cannot rename the staging directory multiple times. This PR ensures that it only happens on the main process.

Fixes # (issue)

Fixes #27925
Alternative to #27929

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts

I would recommend a patch release as this is fully blocking users on multi-GPU after the last release.

thundergolfer · 2023-12-13T16:37:02Z

Somewhat of an aside, but there's no guarantee that a previous writer has created the directory before this point: https://github.com/huggingface/transformers/pull/28009/files#diff-ed55888e6665791fe92cc8fc0c499da54f4ace6738551cd9a2591881cda076deR2379

I've seen recently that a process entering this function can skip past save operations which would create the directory and arrive at this point before another process (the 'main') has a chance to create the directory.

Also, is should_save only ever True for a single process, the main process? If so then it's a misnomer. It's documented as:

Whether or not the current process should write to disk, e.g., to save models and checkpoints.

But in a multi-GPU scenario, multiple processes participate in disk writing against the checkpoint directory.

PS. Sorry for the bug! I didn't test my original change on multi-GPU.

amyeroberts

Thanks for fixing and adding a test!

cc @ArthurZucker who's preparing a patch release

muellerzr · 2023-12-13T16:38:27Z

@thundergolfer re:

Whether or not the current process should write to disk, e.g., to save models and checkpoints.

Yes. You'll find that it's used sparingly during saving of the weights, but the internal check is that we're on process 0

src/transformers/trainer.py

muellerzr · 2023-12-13T16:49:57Z

@thundergolfer re;

I've seen recently that a process entering this function can skip past save operations which would create the directory and arrive at this point before another process (the 'main') has a chance to create the directory.

Can you give an example so I can contextualize the logic in the code to see where we need to fix/make the directory instead?

My best guess is you don't have a model?

I can put an os.mkdir there instead as an option, but would be good for us to be able to write a test for it.

thundergolfer · 2023-12-13T17:01:03Z

Can you give an example...

It was happening in multi-GPU scenario when using the axolotl framework. Figured it was because a non-main process was skipping all the model saving steps and doing the unconditional save of state before main process created the directory. Can look to contribute a failing test which would motivate the change 👍

muellerzr · 2023-12-13T17:02:50Z

That would be great @thundergolfer. I'm hesitant to do too much in this one PR, since this addresses the main issue for base users of the Trainer. A follow up (possibly not part of this patch?) where we can look in-depth at it for axolotl and ensure that it doesn't break other parts would be good.

tangwiki · 2023-12-13T18:38:23Z

It works. Thanks for the quick fix. @muellerzr

* Fix bug * Write test * Keep back old modification for grad accum steps * Whitespace... * Whitespace again * Race condition * Wait for everyone

dumpmemory · 2023-12-15T14:43:02Z

it didn't fix the issue when multi-node training.
transformer 4.36.1
'''
FileNotFoundError: [Errno 2] No such file or directory: '/tmp-checkpoint-14696' -> '/checkpoint-14696'
'''

thundergolfer · 2023-12-15T15:43:51Z

@dumpmemory do you have the full stack-trace and perhaps a small reproduction script?

I think there's still a race condition where checking for directory existence is not atomic with the rename attempt. If it's possible that the dir has already been moved, should just attempt the rename and catch the potential exception.

https://github.com/huggingface/transformers/pull/28009/files#diff-ed55888e6665791fe92cc8fc0c499da54f4ace6738551cd9a2591881cda076deR2390

Also it seems simpler to have only the .should_save == True (main process) do the rename. The rename can only succeed once, and only the main process should perform it.

dumpmemory · 2023-12-15T16:06:47Z

@dumpmemory do you have the full stack-trace and perhaps a small reproduction script?

I think there's still a race condition where checking for directory existence is not atomic with the rename attempt. If it's possible that the dir has already been moved, should just attempt the rename and catch the potential exception.

https://github.com/huggingface/transformers/pull/28009/files#diff-ed55888e6665791fe92cc8fc0c499da54f4ace6738551cd9a2591881cda076deR2390

Also it seems simpler to have only the .should_save == True (main process) do the rename. The rename can only succeed once, and only the main process should perform it.

I am training with multi-gpu setting and shared file system. so each node's rank 0 process try to do the rename which is a race condition.

* Fix bug * Write test * Keep back old modification for grad accum steps * Whitespace... * Whitespace again * Race condition * Wait for everyone

jiezhangGt · 2024-01-12T06:00:03Z

it didn't fix the issue when multi-node training.
transformer 4.36.2
'''
FileNotFoundError: [Errno 2] No such file or directory: '/tmp-checkpoint-5' -> '/checkpoint-5'

dumpmemory · 2024-01-12T06:06:19Z

it didn't fix the issue when multi-node training. transformer 4.36.2 ''' FileNotFoundError: [Errno 2] No such file or directory: '/tmp-checkpoint-5' -> '/checkpoint-5'

pls check the main branch. it might be the issue of nfs

muellerzr added 2 commits December 13, 2023 11:07

Fix bug

354c4e8

Write test

ca72291

muellerzr requested a review from amyeroberts December 13, 2023 16:26

muellerzr mentioned this pull request Dec 13, 2023

fix: handle multiprocess properly in trainer checkpointing #27929

Closed

5 tasks

muellerzr added 3 commits December 13, 2023 11:32

Keep back old modification for grad accum steps

f059a46

Whitespace...

74a0f84

Whitespace again

ad42e99

amyeroberts approved these changes Dec 13, 2023

View reviewed changes

thundergolfer reviewed Dec 13, 2023

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

Race condition

3be182d

Wait for everyone

ee2028a

muellerzr requested a review from ArthurZucker December 13, 2023 17:10

ArthurZucker approved these changes Dec 13, 2023

View reviewed changes

muellerzr merged commit 9376625 into main Dec 13, 2023
3 checks passed

muellerzr deleted the del-only-on-main branch December 13, 2023 17:17

thundergolfer mentioned this pull request Dec 13, 2023

Race condition while saving the checkpoint of the model #28015

Closed

4 tasks

ArthurZucker pushed a commit that referenced this pull request Dec 14, 2023

Fix bug with rotating checkpoints (#28009)

6342b9b

* Fix bug * Write test * Keep back old modification for grad accum steps * Whitespace... * Whitespace again * Race condition * Wait for everyone

heraclex12 mentioned this pull request Dec 14, 2023

4.36 transformers got wrong _save_checkpoint with deepspeed. work with previous versions #28027

Closed

4 tasks

amyeroberts mentioned this pull request Dec 15, 2023

Fix bug for checkpoint saving on multi node training setting #28078

Merged

5 tasks

zeyugao mentioned this pull request Dec 28, 2023

Sliently ignore the FileNotFoundError exception when mv staging output dir #28269

Closed

5 tasks

yiranma0 mentioned this pull request Jan 25, 2024

Save model checkpoint error when multi-gpu training still happens on 4.36.1 #28119

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug with rotating checkpoints #28009

Fix bug with rotating checkpoints #28009

muellerzr commented Dec 13, 2023

thundergolfer commented Dec 13, 2023 •

edited

Loading

amyeroberts left a comment

muellerzr commented Dec 13, 2023

muellerzr commented Dec 13, 2023 •

edited

Loading

thundergolfer commented Dec 13, 2023

muellerzr commented Dec 13, 2023 •

edited

Loading

tangwiki commented Dec 13, 2023

dumpmemory commented Dec 15, 2023

thundergolfer commented Dec 15, 2023

dumpmemory commented Dec 15, 2023

jiezhangGt commented Jan 12, 2024

dumpmemory commented Jan 12, 2024

Fix bug with rotating checkpoints #28009

Fix bug with rotating checkpoints #28009

Conversation

muellerzr commented Dec 13, 2023

What does this PR do?

Before submitting

Who can review?

thundergolfer commented Dec 13, 2023 • edited Loading

amyeroberts left a comment

Choose a reason for hiding this comment

muellerzr commented Dec 13, 2023

muellerzr commented Dec 13, 2023 • edited Loading

thundergolfer commented Dec 13, 2023

muellerzr commented Dec 13, 2023 • edited Loading

tangwiki commented Dec 13, 2023

dumpmemory commented Dec 15, 2023

thundergolfer commented Dec 15, 2023

dumpmemory commented Dec 15, 2023

jiezhangGt commented Jan 12, 2024

dumpmemory commented Jan 12, 2024

thundergolfer commented Dec 13, 2023 •

edited

Loading

muellerzr commented Dec 13, 2023 •

edited

Loading

muellerzr commented Dec 13, 2023 •

edited

Loading