Adpat timeout value of torch.dist.init_process_group depending on elapsed time #2422

eunwoosh · 2023-08-07T06:09:03Z

Summary

When huge dataset is used, more than 30 seconds is used to prepare dataset, which raises an error when training with multi GPU because current torch.dist.init_process_group timeout is set as 30.
So, I implemented a feature to adapt timeout value if process start time is given.

How to test

Checklist

I have added unit tests to cover my changes.
I have added integration tests to cover my changes.
I have added e2e tests for validation.
I have added the description of my changes into CHANGELOG in my target branch (e.g., CHANGELOG in develop).
I have updated the documentation in my target branch accordingly (e.g., documentation in develop).
I have linked related issues.

License

I submit my code changes under the same Apache License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below).

# Copyright (C) 2023 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

goodsong81

Sorry for the nitpick but could you fix the changelog while resoling the conflict anyway? :)

CHANGELOG.md

sungmanc · 2023-08-08T04:21:34Z

I have a question, are preparing the dataset and dist group initialization conducted in parallel? why prepare dataset affects the dist group initialization?

eunwoosh · 2023-08-08T04:35:41Z

I have a question, are preparing the dataset and dist group initialization conducted in parallel? why prepare dataset affects the dist group initialization?

Spawning multi process is executed after preparing dataset. So, When main process initializes first and waits for child process initialization, child process should prepare dataset first before initialization. So if dataset preparation is more than timeout seconds, timeout error is raised.

sungmanc · 2023-08-08T04:40:31Z

I have a question, are preparing the dataset and dist group initialization conducted in parallel? why prepare dataset affects the dist group initialization?

Spawning multi process is executed after preparing dataset. So, When main process initializes first and waits for child process initialization, child process should prepare dataset first before initialization. So if dataset preparation is more than timeout seconds, timeout error is raised.

Understood, thanks for the kind explanation

Co-authored-by: Songki Choi <songki.choi@intel.com>

eunwoosh · 2023-08-08T04:47:39Z

Sorry I add a commit to resolve conflict of CHANGELOG.md. Could you review PR again, @sungmanc @goodsong81

sungmanc

No problem :)

github-actions bot added CLI Any changes in OTE CLI ALGO Any changes in OTX Algo Tasks implementation TEST Any changes in tests DOC Improvements or additions to documentation labels Aug 7, 2023

eunwoosh marked this pull request as ready for review August 7, 2023 07:58

eunwoosh requested a review from a team as a code owner August 7, 2023 07:58

goodsong81 suggested changes Aug 8, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

sungmanc changed the title ~~Adpat timeout vlaue of torch.dist.init_process_group depending on elapsed time~~ Adpat timeout value of torch.dist.init_process_group depending on elapsed time Aug 8, 2023

goodsong81 previously approved these changes Aug 8, 2023

View reviewed changes

sungmanc previously approved these changes Aug 8, 2023

View reviewed changes

eunwoosh and others added 5 commits August 8, 2023 13:45

implement auto dist timeout alignment

59f5edd

update unit test

4046be7

align with pre-commit

01a5f6b

update changelog

a2c7d64

Update CHANGELOG.md

1bd56ea

Co-authored-by: Songki Choi <songki.choi@intel.com>

eunwoosh dismissed stale reviews from sungmanc and goodsong81 via 1bd56ea August 8, 2023 04:46

eunwoosh force-pushed the dist_timeout_auto_alignment branch from 22c1926 to 1bd56ea Compare August 8, 2023 04:46

sungmanc approved these changes Aug 8, 2023

View reviewed changes

goodsong81 approved these changes Aug 8, 2023

View reviewed changes

jaegukhyun approved these changes Aug 8, 2023

View reviewed changes

eunwoosh merged commit 6f0aa7f into openvinotoolkit:develop Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adpat timeout value of torch.dist.init_process_group depending on elapsed time #2422

Adpat timeout value of torch.dist.init_process_group depending on elapsed time #2422

eunwoosh commented Aug 7, 2023 •

edited

Loading

goodsong81 left a comment

sungmanc commented Aug 8, 2023

eunwoosh commented Aug 8, 2023

sungmanc commented Aug 8, 2023

eunwoosh commented Aug 8, 2023

sungmanc left a comment

Adpat timeout value of torch.dist.init_process_group depending on elapsed time #2422

Adpat timeout value of torch.dist.init_process_group depending on elapsed time #2422

Conversation

eunwoosh commented Aug 7, 2023 • edited Loading

Summary

How to test

Checklist

License

goodsong81 left a comment

Choose a reason for hiding this comment

sungmanc commented Aug 8, 2023

eunwoosh commented Aug 8, 2023

sungmanc commented Aug 8, 2023

eunwoosh commented Aug 8, 2023

sungmanc left a comment

Choose a reason for hiding this comment

eunwoosh commented Aug 7, 2023 •

edited

Loading