Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adpat timeout value of torch.dist.init_process_group depending on elapsed time #2422

Merged

Conversation

eunwoosh
Copy link
Contributor

@eunwoosh eunwoosh commented Aug 7, 2023

Summary

When huge dataset is used, more than 30 seconds is used to prepare dataset, which raises an error when training with multi GPU because current torch.dist.init_process_group timeout is set as 30.
So, I implemented a feature to adapt timeout value if process start time is given.

How to test

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have added e2e tests for validation.
  • I have added the description of my changes into CHANGELOG in my target branch (e.g., CHANGELOG in develop).​
  • I have updated the documentation in my target branch accordingly (e.g., documentation in develop).
  • I have linked related issues.

License

  • I submit my code changes under the same Apache License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2023 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

@github-actions github-actions bot added CLI Any changes in OTE CLI ALGO Any changes in OTX Algo Tasks implementation TEST Any changes in tests DOC Improvements or additions to documentation labels Aug 7, 2023
@eunwoosh eunwoosh marked this pull request as ready for review August 7, 2023 07:58
@eunwoosh eunwoosh requested a review from a team as a code owner August 7, 2023 07:58
Copy link

@goodsong81 goodsong81 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the nitpick but could you fix the changelog while resoling the conflict anyway? :)

@sungmanc sungmanc changed the title Adpat timeout vlaue of torch.dist.init_process_group depending on elapsed time Adpat timeout value of torch.dist.init_process_group depending on elapsed time Aug 8, 2023
@sungmanc
Copy link
Contributor

sungmanc commented Aug 8, 2023

I have a question, are preparing the dataset and dist group initialization conducted in parallel? why prepare dataset affects the dist group initialization?

goodsong81
goodsong81 previously approved these changes Aug 8, 2023
@eunwoosh
Copy link
Contributor Author

eunwoosh commented Aug 8, 2023

I have a question, are preparing the dataset and dist group initialization conducted in parallel? why prepare dataset affects the dist group initialization?

Spawning multi process is executed after preparing dataset. So, When main process initializes first and waits for child process initialization, child process should prepare dataset first before initialization. So if dataset preparation is more than timeout seconds, timeout error is raised.

@sungmanc
Copy link
Contributor

sungmanc commented Aug 8, 2023

I have a question, are preparing the dataset and dist group initialization conducted in parallel? why prepare dataset affects the dist group initialization?

Spawning multi process is executed after preparing dataset. So, When main process initializes first and waits for child process initialization, child process should prepare dataset first before initialization. So if dataset preparation is more than timeout seconds, timeout error is raised.

Understood, thanks for the kind explanation

sungmanc
sungmanc previously approved these changes Aug 8, 2023
@eunwoosh eunwoosh dismissed stale reviews from sungmanc and goodsong81 via 1bd56ea August 8, 2023 04:46
@eunwoosh eunwoosh force-pushed the dist_timeout_auto_alignment branch from 22c1926 to 1bd56ea Compare August 8, 2023 04:46
@eunwoosh
Copy link
Contributor Author

eunwoosh commented Aug 8, 2023

Sorry I add a commit to resolve conflict of CHANGELOG.md. Could you review PR again, @sungmanc @goodsong81

Copy link
Contributor

@sungmanc sungmanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem :)

@eunwoosh eunwoosh merged commit 6f0aa7f into openvinotoolkit:develop Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ALGO Any changes in OTX Algo Tasks implementation CLI Any changes in OTE CLI DOC Improvements or additions to documentation TEST Any changes in tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants