Sliently ignore the FileNotFoundError exception when mv staging output dir #28269

zeyugao · 2023-12-28T04:47:48Z

What does this PR do?

Related to #28009

In this PR, it tries to mitigate the problem of inconsistency of filesystem in multiple node training. That is, if we rename the dir in one node, the existence of the staging dir may not propagate to other node in a shared filesystem scenario.

That is, the filesystem is not a reliable synchronization mechanism compared to cuda.

As shown in the figure, in node-0, after os.path.exists(staging_output_dir) becoming False, on the other node, it is still True.

In this PR, I catch the FileNotFoundError exception to mitigate the issue. However, maybe we can just do renaming in the local main or main process, instead of every process to avoid using try catch that can conceal the potential unexpected error.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@muellerzr and @pacman100

…t dir

peter-sk · 2023-12-29T22:03:11Z

I think we should opt for a solution that avoids the need for this PR. And one that works for multi-node setups with and without shared file system.

nlpcat · 2024-01-04T09:48:32Z

may I ask whether we can get a solution for this in main branch? this is also an issue to block us from using 4.36+ for this multi node multi gpu training. thanks. + @muellerzr

muellerzr · 2024-01-05T15:18:06Z

Agreed with @peter-sk, we're looking at this ASAP but just ignoring it is not the right solution.

Sliently ignore the FileNotFoundError exception when mv staging outpu…

73de8d0

…t dir

ArthurZucker requested a review from muellerzr January 5, 2024 10:03

zeyugao closed this Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sliently ignore the FileNotFoundError exception when mv staging output dir #28269

Sliently ignore the FileNotFoundError exception when mv staging output dir #28269

zeyugao commented Dec 28, 2023 •

edited

Loading

peter-sk commented Dec 29, 2023

nlpcat commented Jan 4, 2024 •

edited

Loading

muellerzr commented Jan 5, 2024 •

edited

Loading

Sliently ignore the FileNotFoundError exception when mv staging output dir #28269

Sliently ignore the FileNotFoundError exception when mv staging output dir #28269

Conversation

zeyugao commented Dec 28, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

peter-sk commented Dec 29, 2023

nlpcat commented Jan 4, 2024 • edited Loading

muellerzr commented Jan 5, 2024 • edited Loading

zeyugao commented Dec 28, 2023 •

edited

Loading

nlpcat commented Jan 4, 2024 •

edited

Loading

muellerzr commented Jan 5, 2024 •

edited

Loading