Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds timeout argument to training_args to avoid socket timeouts in DDP #18562

Merged
merged 8 commits into from
Sep 1, 2022
Merged

Conversation

gugarosa
Copy link
Contributor

@gugarosa gugarosa commented Aug 10, 2022

What does this PR do?

This PR follows the work done in #18081 and adds a timeout argument to TrainingArgs to avoid Socket Timeouts when using PyTorch's torch.distributed.init_process_group: https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group

timeout argument exists since 1.0.0: https://pytorch.org/docs/1.0.0/distributed.html. This prevents any regression.

Fixes #18054 #17106 and finishes the open PR #18081.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 10, 2022

The documentation is not available anymore as the PR was closed or merged.

@LysandreJik LysandreJik requested a review from sgugger August 16, 2022 06:52
@LysandreJik
Copy link
Member

Hey @gugarosa, thanks for your PR! I'm asking Sylvain to review it as he's the maintainer of the Trainer, but he's on the leave for the next few weeks. He'll review your PR when he's back!

Thanks for your patience 🙏

@gugarosa
Copy link
Contributor Author

No worries @LysandreJik! Thanks so much for the attention!

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this! Left a few comments on the naming, and some documentation is missing.

@@ -963,6 +964,19 @@ class TrainingArguments:
)
},
)
timeout: Optional[int] = field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timeout: Optional[int] = field(
ddp_timeout: Optional[int] = field(

Let's make it clear this is a DDP argument.

Comment on lines 971 to 976
"Overrides the default timeout defined by PyTorch and"
" introduces a way to prevent Socket Timeout when mapping large datasets."
" Expects timeout in seconds. Used for timeout argument in"
" torch.distributed.init_process_group calls. Please refer the PyTorch documentation"
" https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group"
" for more information."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit too long here, and the docstring for this new argument is missing. I'd limit the help here to

Suggested change
"Overrides the default timeout defined by PyTorch and"
" introduces a way to prevent Socket Timeout when mapping large datasets."
" Expects timeout in seconds. Used for timeout argument in"
" torch.distributed.init_process_group calls. Please refer the PyTorch documentation"
" https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group"
" for more information."
"Overrides the default timeout for distributed training (value should be given in seconds).
```"
and you can add more info in the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds perfect! Thanks @sgugger. I will push the changes in a couple of minutes.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! I have two last small nits and we can merge this.

@@ -481,6 +482,11 @@ class TrainingArguments:
are also available. See the [Ray documentation](
https://docs.ray.io/en/latest/tune/api_docs/analysis.html#ray.tune.ExperimentAnalysis.get_best_trial) for
more options.
ddp_timeout (`int`, *optional*, defaults to `1800`):
The timeout for torch.distributed.init_process_group calls, used to avoid GPU socket timeouts when
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The timeout for torch.distributed.init_process_group calls, used to avoid GPU socket timeouts when
The timeout for `torch.distributed.init_process_group` calls, used to avoid GPU socket timeouts when

@@ -481,6 +482,11 @@ class TrainingArguments:
are also available. See the [Ray documentation](
https://docs.ray.io/en/latest/tune/api_docs/analysis.html#ray.tune.ExperimentAnalysis.get_best_trial) for
more options.
ddp_timeout (`int`, *optional*, defaults to `1800`):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ddp_timeout (`int`, *optional*, defaults to `1800`):
ddp_timeout (`int`, *optional*, defaults to 1800):

No code blocks for ints :-)

@sgugger
Copy link
Collaborator

sgugger commented Sep 1, 2022

You just need to run make style and we should be good!

@gugarosa
Copy link
Contributor Author

gugarosa commented Sep 1, 2022

You just need to run make style and we should be good!

My bad! I always forget to run it. Just squashed the previous commits and added the make style. Hopefully, it will pass all tests in a couple minutes!

Thanks for all the attention on this PR!

@sgugger sgugger merged commit fe58929 into huggingface:main Sep 1, 2022
oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
huggingface#18562)

* chore(training_args): Adds support for timeout argument.

* fix(training_args): Passes make style through changes.

* fix(training_args): Removes wrong docstring sentence.

* fix(training_args): Fixes timeout not being JSON serializable.

* fix(training_args_sm): Also updates timeout to timeout_delta.

* fix(training_args): Fixes PR according to suggestions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include timeout attribute (related to DDP) to TrainingArguments
4 participants