Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sliently ignore the FileNotFoundError exception when mv staging output dir #28269

Closed
wants to merge 1 commit into from

Conversation

zeyugao
Copy link

@zeyugao zeyugao commented Dec 28, 2023

What does this PR do?

Related to #28009

In this PR, it tries to mitigate the problem of inconsistency of filesystem in multiple node training. That is, if we rename the dir in one node, the existence of the staging dir may not propagate to other node in a shared filesystem scenario.

That is, the filesystem is not a reliable synchronization mechanism compared to cuda.

As shown in the figure, in node-0, after os.path.exists(staging_output_dir) becoming False, on the other node, it is still True.

image

In this PR, I catch the FileNotFoundError exception to mitigate the issue. However, maybe we can just do renaming in the local main or main process, instead of every process to avoid using try catch that can conceal the potential unexpected error.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@muellerzr and @pacman100

@peter-sk
Copy link
Contributor

I think we should opt for a solution that avoids the need for this PR. And one that works for multi-node setups with and without shared file system.

@nlpcat
Copy link
Contributor

nlpcat commented Jan 4, 2024

may I ask whether we can get a solution for this in main branch? this is also an issue to block us from using 4.36+ for this multi node multi gpu training. thanks. + @muellerzr

@ArthurZucker ArthurZucker requested a review from muellerzr January 5, 2024 10:03
@muellerzr
Copy link
Contributor

muellerzr commented Jan 5, 2024

Agreed with @peter-sk, we're looking at this ASAP but just ignoring it is not the right solution.

@zeyugao zeyugao closed this Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants