Fix for checkpoint rename race condition #28364

tblattner · 2024-01-05T23:09:03Z

What does this PR do?

When running distributed training with deepspeed, I was encountered a race condition due to os.rename not being atomic on network filesystems. This rework, changes the logic for renaming to only run on the main processes, or a single main process depending on the save_on_each_node flag. Also added is the use of fsync to try to flush buffers, hopefully ensuring the rename is completed. fsync may have no effect in some filesystems, so a better mechanism may be required to ensure that the rename completed.

Fixes #27925

Before submitting

[No] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[Yes] Did you read the contributor guideline,
Pull Request section?
[Discussed on Github issue] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[Yes] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[No] Did you write any new necessary tests?

Who can review?

@muellerzr
@pacman100

tblattner · 2024-01-05T23:11:22Z

One thing to note, I attempted to reuse the existing with block:
with self.args.main_process_first(desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node):

and include the fsync. Unfortunately fsync did not flush buffers related to the staging directory, so it still failed on other processes. This raises some concerns as to the behavior of fsync on network attached storage using NFS.

siddartha-RE · 2024-01-06T23:42:27Z

Oops, missed this. I looked yesterday :) but I guess you poster after I looked. This is my version:
#28373

I don't see why existence check is required if it is happening only once per node.

tblattner · 2024-01-08T15:58:19Z

Oops, missed this. I looked yesterday :) but I guess you poster after I looked. This is my version: #28373

I don't see why existence check is required if it is happening only once per node.

Could be a race if the output directory for the checkpoint is used sometime later in the code. If that is not the case, then shouldn't be an issue.

muellerzr

Hi all, I think the best solution here is a mix of both. @siddartha-RE your approach of using existing functions in the state is better than Accelerate here for readability and focusing it on the Trainer state, and @tblattner using fsync here I think is more robust and better.

I propose that @siddartha-RE can you make your PR just make the modifications to trainer_callback.py and we can handle the OS issue in trainer in this PR?

Thank you both so much for your wonderful solutions and PR's.

src/transformers/trainer.py

muellerzr · 2024-01-08T17:11:22Z

src/transformers/trainer.py

        if self.args.should_save:
            self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)


Let's also move this to be under this if/else so that way we can have it be done just on a single process as needed

(Above the wait_for_everyone)

I actually have a related question. I noticed that the current code has push_to_hub unguarded by any check for is_local / is_world zero. This seems incorrect. However, I don't use that option so I didn't want to touch it without understanding implications.

My guess is that it would have been best to do push_to_hub after wait_for_everyone from the final output_dir. Otherwise it seems like the push could end up shipping partially written state.

I reworked the logic for this so that we include the rotate_checkpoints function. Also curious about push_to_hub and the save_to_json that is above line 2384.

Agreed thinking more on it, it should be after for the reasons you mentioned

Also, push_to_hub checks is_world_process_zero(), so it's fine :)

src/transformers/trainer.py

…o only operate with the main process. Added fsync functionality to attempt to flush the write changes in case os.rename is not atomic.

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

Added with open usage to ensure better file closing as suggested from PR Added rotate_checkpoints into main process logic

…to work for directories.

HuggingFaceDocBuilderDev · 2024-01-09T07:26:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr

Thanks for the great fix!

ArthurZucker

LGTM we no longer get a warning saying that the model checkpoint is renamed but not an issue IMO

ArthurZucker · 2024-01-10T15:55:56Z

Thanks @tblattner 🤗

xiaojunjie · 2024-01-15T10:13:32Z

if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
if self.args.should_save
May I ask What is the difference between the two lines above？Can replace Line1 with Line2

* Changed logic for renaming staging directory when saving checkpoint to only operate with the main process. Added fsync functionality to attempt to flush the write changes in case os.rename is not atomic. * Updated styling using make fixup * Updated check for main process to use built-in versions from trainer Co-authored-by: Zach Mueller <muellerzr@gmail.com> * Fixed incorrect usage of trainer main process checks Added with open usage to ensure better file closing as suggested from PR Added rotate_checkpoints into main process logic * Removed "with open" due to not working with directory. os.open seems to work for directories. --------- Co-authored-by: Zach Mueller <muellerzr@gmail.com>

siddartha-RE · 2024-01-15T15:05:19Z

if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():

if self.args.should_save
May I ask What is the difference between the two lines above？Can replace Line1 with Line2

Definition of should_save looks like they would be equivalent and the second one a little clearer. It will also allow the extra check below for should_save to be removed.

* Changed logic for renaming staging directory when saving checkpoint to only operate with the main process. Added fsync functionality to attempt to flush the write changes in case os.rename is not atomic. * Updated styling using make fixup * Updated check for main process to use built-in versions from trainer Co-authored-by: Zach Mueller <muellerzr@gmail.com> * Fixed incorrect usage of trainer main process checks Added with open usage to ensure better file closing as suggested from PR Added rotate_checkpoints into main process logic * Removed "with open" due to not working with directory. os.open seems to work for directories. --------- Co-authored-by: Zach Mueller <muellerzr@gmail.com>

Vechtomov · 2024-01-21T16:08:44Z

Looks like fd = os.open(output_dir, os.O_RDONLY) doesn't work on Windows:

* Changed logic for renaming staging directory when saving checkpoint to only operate with the main process. Added fsync functionality to attempt to flush the write changes in case os.rename is not atomic. * Updated styling using make fixup * Updated check for main process to use built-in versions from trainer Co-authored-by: Zach Mueller <muellerzr@gmail.com> * Fixed incorrect usage of trainer main process checks Added with open usage to ensure better file closing as suggested from PR Added rotate_checkpoints into main process logic * Removed "with open" due to not working with directory. os.open seems to work for directories. --------- Co-authored-by: Zach Mueller <muellerzr@gmail.com>

tblattner · 2024-01-22T02:55:36Z

Looks like fd = os.open(output_dir, os.O_RDONLY) doesn't work on Windows:

This is indeed an issue. Windows handles directories differently than Linux. I'm not an expert at Windows dev in python, so I don't have a good solution sorry!

Vechtomov · 2024-01-22T11:27:23Z

I think this is critical, when your training fails at the first checkpoint. Maybe as a workaround we can add a condition for non-Windows systems?

if platform.system() != 'Windows':
    fd = os.open(output_dir, os.O_RDONLY)
    os.fsync(fd)
    os.close(fd)

@muellerzr @siddartha-RE

muellerzr · 2024-01-22T11:29:03Z

Yes, quick PR going in a moment.

* Changed logic for renaming staging directory when saving checkpoint to only operate with the main process. Added fsync functionality to attempt to flush the write changes in case os.rename is not atomic. * Updated styling using make fixup * Updated check for main process to use built-in versions from trainer Co-authored-by: Zach Mueller <muellerzr@gmail.com> * Fixed incorrect usage of trainer main process checks Added with open usage to ensure better file closing as suggested from PR Added rotate_checkpoints into main process logic * Removed "with open" due to not working with directory. os.open seems to work for directories. --------- Co-authored-by: Zach Mueller <muellerzr@gmail.com>

muellerzr · 2024-01-23T13:31:31Z

@Vechtomov @tblattner #28637 fixed it.

Unsure how it affects multinode on windows but if a user has this situation and hits it then we can deal with it then as there's not really a clean solution for doing so in python :(

muellerzr reviewed Jan 8, 2024

View reviewed changes

muellerzr mentioned this pull request Jan 8, 2024

Change progress logging to once across all nodes #28373

Merged

5 tasks

siddartha-RE reviewed Jan 8, 2024

View reviewed changes

src/transformers/trainer.py Show resolved Hide resolved

tblattner and others added 5 commits January 9, 2024 00:10

Changed logic for renaming staging directory when saving checkpoint t…

728231c

…o only operate with the main process. Added fsync functionality to attempt to flush the write changes in case os.rename is not atomic.

Updated styling using make fixup

f2452fd

Updated check for main process to use built-in versions from trainer

3c1c92f

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

Fixed incorrect usage of trainer main process checks

df9f6e9

Added with open usage to ensure better file closing as suggested from PR Added rotate_checkpoints into main process logic

Removed "with open" due to not working with directory. os.open seems …

eb58698

…to work for directories.

tblattner force-pushed the fix-checkpoint-rename-race-condition branch from a9fe43c to eb58698 Compare January 9, 2024 05:10

muellerzr approved these changes Jan 9, 2024

View reviewed changes

muellerzr requested a review from ArthurZucker January 9, 2024 09:33

ArthurZucker approved these changes Jan 10, 2024

View reviewed changes

ArthurZucker merged commit cef2e40 into huggingface:main Jan 10, 2024
21 checks passed

ArthurZucker mentioned this pull request Jan 11, 2024

Wrong checkpoint got deleted when use_mtime=True #26961

Closed

muellerzr mentioned this pull request Jan 22, 2024

Fix windows err with checkpoint race conditions #28637

Merged

5 tasks

muellerzr mentioned this pull request Feb 29, 2024

🚨 Fully revert atomic checkpointing 🚨 #29370

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for checkpoint rename race condition #28364

Fix for checkpoint rename race condition #28364

tblattner commented Jan 5, 2024

tblattner commented Jan 5, 2024

siddartha-RE commented Jan 6, 2024

tblattner commented Jan 8, 2024

muellerzr left a comment

muellerzr Jan 8, 2024

muellerzr Jan 8, 2024

siddartha-RE Jan 8, 2024

tblattner Jan 9, 2024

muellerzr Jan 9, 2024

muellerzr Jan 9, 2024

HuggingFaceDocBuilderDev commented Jan 9, 2024

muellerzr left a comment

ArthurZucker left a comment

ArthurZucker commented Jan 10, 2024

xiaojunjie commented Jan 15, 2024

siddartha-RE commented Jan 15, 2024

Vechtomov commented Jan 21, 2024

tblattner commented Jan 22, 2024

Vechtomov commented Jan 22, 2024

muellerzr commented Jan 22, 2024

muellerzr commented Jan 23, 2024

		if self.args.should_save:
		self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)

Fix for checkpoint rename race condition #28364

Fix for checkpoint rename race condition #28364

Conversation

tblattner commented Jan 5, 2024

What does this PR do?

Before submitting

Who can review?

tblattner commented Jan 5, 2024

siddartha-RE commented Jan 6, 2024

tblattner commented Jan 8, 2024

muellerzr left a comment

Choose a reason for hiding this comment

muellerzr Jan 8, 2024

Choose a reason for hiding this comment

muellerzr Jan 8, 2024

Choose a reason for hiding this comment

siddartha-RE Jan 8, 2024

Choose a reason for hiding this comment

tblattner Jan 9, 2024

Choose a reason for hiding this comment

muellerzr Jan 9, 2024

Choose a reason for hiding this comment

muellerzr Jan 9, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jan 9, 2024

muellerzr left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Jan 10, 2024

xiaojunjie commented Jan 15, 2024

siddartha-RE commented Jan 15, 2024

Vechtomov commented Jan 21, 2024

tblattner commented Jan 22, 2024

Vechtomov commented Jan 22, 2024

muellerzr commented Jan 22, 2024

muellerzr commented Jan 23, 2024