-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds timeout argument to training_args to avoid socket timeouts in DDP #18562
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Hey @gugarosa, thanks for your PR! I'm asking Sylvain to review it as he's the maintainer of the Thanks for your patience 🙏 |
No worries @LysandreJik! Thanks so much for the attention! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this! Left a few comments on the naming, and some documentation is missing.
src/transformers/training_args.py
Outdated
@@ -963,6 +964,19 @@ class TrainingArguments: | |||
) | |||
}, | |||
) | |||
timeout: Optional[int] = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout: Optional[int] = field( | |
ddp_timeout: Optional[int] = field( |
Let's make it clear this is a DDP argument.
src/transformers/training_args.py
Outdated
"Overrides the default timeout defined by PyTorch and" | ||
" introduces a way to prevent Socket Timeout when mapping large datasets." | ||
" Expects timeout in seconds. Used for timeout argument in" | ||
" torch.distributed.init_process_group calls. Please refer the PyTorch documentation" | ||
" https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group" | ||
" for more information." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit too long here, and the docstring for this new argument is missing. I'd limit the help here to
"Overrides the default timeout defined by PyTorch and" | |
" introduces a way to prevent Socket Timeout when mapping large datasets." | |
" Expects timeout in seconds. Used for timeout argument in" | |
" torch.distributed.init_process_group calls. Please refer the PyTorch documentation" | |
" https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group" | |
" for more information." | |
"Overrides the default timeout for distributed training (value should be given in seconds). | |
```" | |
and you can add more info in the docstring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds perfect! Thanks @sgugger. I will push the changes in a couple of minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating! I have two last small nits and we can merge this.
src/transformers/training_args.py
Outdated
@@ -481,6 +482,11 @@ class TrainingArguments: | |||
are also available. See the [Ray documentation]( | |||
https://docs.ray.io/en/latest/tune/api_docs/analysis.html#ray.tune.ExperimentAnalysis.get_best_trial) for | |||
more options. | |||
ddp_timeout (`int`, *optional*, defaults to `1800`): | |||
The timeout for torch.distributed.init_process_group calls, used to avoid GPU socket timeouts when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout for torch.distributed.init_process_group calls, used to avoid GPU socket timeouts when | |
The timeout for `torch.distributed.init_process_group` calls, used to avoid GPU socket timeouts when |
src/transformers/training_args.py
Outdated
@@ -481,6 +482,11 @@ class TrainingArguments: | |||
are also available. See the [Ray documentation]( | |||
https://docs.ray.io/en/latest/tune/api_docs/analysis.html#ray.tune.ExperimentAnalysis.get_best_trial) for | |||
more options. | |||
ddp_timeout (`int`, *optional*, defaults to `1800`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ddp_timeout (`int`, *optional*, defaults to `1800`): | |
ddp_timeout (`int`, *optional*, defaults to 1800): |
No code blocks for ints :-)
You just need to run |
My bad! I always forget to run it. Just squashed the previous commits and added the Thanks for all the attention on this PR! |
huggingface#18562) * chore(training_args): Adds support for timeout argument. * fix(training_args): Passes make style through changes. * fix(training_args): Removes wrong docstring sentence. * fix(training_args): Fixes timeout not being JSON serializable. * fix(training_args_sm): Also updates timeout to timeout_delta. * fix(training_args): Fixes PR according to suggestions.
What does this PR do?
This PR follows the work done in #18081 and adds a
timeout
argument toTrainingArgs
to avoid Socket Timeouts when using PyTorch'storch.distributed.init_process_group
: https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_grouptimeout
argument exists since 1.0.0: https://pytorch.org/docs/1.0.0/distributed.html. This prevents any regression.Fixes #18054 #17106 and finishes the open PR #18081.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.