Adds timeout argument to training_args to avoid socket timeouts in DDP #18562

gugarosa · 2022-08-10T17:41:26Z

What does this PR do?

This PR follows the work done in #18081 and adds a timeout argument to TrainingArgs to avoid Socket Timeouts when using PyTorch's torch.distributed.init_process_group: https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group

timeout argument exists since 1.0.0: https://pytorch.org/docs/1.0.0/distributed.html. This prevents any regression.

Fixes #18054 #17106 and finishes the open PR #18081.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-08-10T17:54:54Z

The documentation is not available anymore as the PR was closed or merged.

LysandreJik · 2022-08-16T06:53:29Z

Hey @gugarosa, thanks for your PR! I'm asking Sylvain to review it as he's the maintainer of the Trainer, but he's on the leave for the next few weeks. He'll review your PR when he's back!

Thanks for your patience 🙏

gugarosa · 2022-08-16T12:17:38Z

No worries @LysandreJik! Thanks so much for the attention!

sgugger

Thanks a lot for working on this! Left a few comments on the naming, and some documentation is missing.

sgugger · 2022-08-31T12:37:52Z

src/transformers/training_args.py

@@ -963,6 +964,19 @@ class TrainingArguments:
            )
        },
    )
+    timeout: Optional[int] = field(


Suggested change

timeout: Optional[int] = field(

ddp_timeout: Optional[int] = field(

Let's make it clear this is a DDP argument.

sgugger · 2022-08-31T12:39:14Z

src/transformers/training_args.py

+                "Overrides the default timeout defined by PyTorch and"
+                " introduces a way to prevent Socket Timeout when mapping large datasets."
+                " Expects timeout in seconds. Used for timeout argument in"
+                " torch.distributed.init_process_group calls. Please refer the PyTorch documentation"
+                " https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group"
+                " for more information."


This is a bit too long here, and the docstring for this new argument is missing. I'd limit the help here to

Suggested change

"Overrides the default timeout defined by PyTorch and"

" introduces a way to prevent Socket Timeout when mapping large datasets."

" Expects timeout in seconds. Used for timeout argument in"

" torch.distributed.init_process_group calls. Please refer the PyTorch documentation"

" https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group"

" for more information."

"Overrides the default timeout for distributed training (value should be given in seconds).

```"

and you can add more info in the docstring.

Sounds perfect! Thanks @sgugger. I will push the changes in a couple of minutes.

sgugger

Thanks for iterating! I have two last small nits and we can merge this.

sgugger · 2022-09-01T10:47:14Z

src/transformers/training_args.py

@@ -481,6 +482,11 @@ class TrainingArguments:
            are also available. See the [Ray documentation](
            https://docs.ray.io/en/latest/tune/api_docs/analysis.html#ray.tune.ExperimentAnalysis.get_best_trial) for
            more options.
+        ddp_timeout (`int`, *optional*, defaults to `1800`):
+            The timeout for torch.distributed.init_process_group calls, used to avoid GPU socket timeouts when


Suggested change

The timeout for torch.distributed.init_process_group calls, used to avoid GPU socket timeouts when

The timeout for `torch.distributed.init_process_group` calls, used to avoid GPU socket timeouts when

sgugger · 2022-09-01T10:47:27Z

src/transformers/training_args.py

@@ -481,6 +482,11 @@ class TrainingArguments:
            are also available. See the [Ray documentation](
            https://docs.ray.io/en/latest/tune/api_docs/analysis.html#ray.tune.ExperimentAnalysis.get_best_trial) for
            more options.
+        ddp_timeout (`int`, *optional*, defaults to `1800`):


Suggested change

ddp_timeout (`int`, *optional*, defaults to `1800`):

ddp_timeout (`int`, *optional*, defaults to 1800):

No code blocks for ints :-)

sgugger · 2022-09-01T13:12:23Z

You just need to run make style and we should be good!

gugarosa · 2022-09-01T13:17:04Z

You just need to run make style and we should be good!

My bad! I always forget to run it. Just squashed the previous commits and added the make style. Hopefully, it will pass all tests in a couple minutes!

Thanks for all the attention on this PR!

huggingface#18562) * chore(training_args): Adds support for timeout argument. * fix(training_args): Passes make style through changes. * fix(training_args): Removes wrong docstring sentence. * fix(training_args): Fixes timeout not being JSON serializable. * fix(training_args_sm): Also updates timeout to timeout_delta. * fix(training_args): Fixes PR according to suggestions.

gugarosa added 3 commits August 10, 2022 14:34

chore(training_args): Adds support for timeout argument.

fb70aa8

fix(training_args): Passes make style through changes.

ef0cc25

fix(training_args): Removes wrong docstring sentence.

ae26620

gugarosa added 3 commits August 10, 2022 12:12

fix(training_args): Fixes timeout not being JSON serializable.

e0f710d

Merge branch 'main' of github.com:gugarosa/transformers into main

3735e84

fix(training_args_sm): Also updates timeout to timeout_delta.

f5c652e

LysandreJik requested a review from sgugger August 16, 2022 06:52

sgugger reviewed Aug 31, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/main'

a2ed004

sgugger approved these changes Sep 1, 2022

View reviewed changes

fix(training_args): Fixes PR according to suggestions.

d40d993

sgugger merged commit fe58929 into huggingface:main Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds timeout argument to training_args to avoid socket timeouts in DDP #18562

Adds timeout argument to training_args to avoid socket timeouts in DDP #18562

gugarosa commented Aug 10, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 10, 2022 •

edited

Loading

LysandreJik commented Aug 16, 2022

gugarosa commented Aug 16, 2022

sgugger left a comment

sgugger Aug 31, 2022

sgugger Aug 31, 2022

gugarosa Aug 31, 2022

sgugger left a comment

sgugger Sep 1, 2022

sgugger Sep 1, 2022

sgugger commented Sep 1, 2022

gugarosa commented Sep 1, 2022

	timeout: Optional[int] = field(
	ddp_timeout: Optional[int] = field(

	The timeout for torch.distributed.init_process_group calls, used to avoid GPU socket timeouts when
	The timeout for `torch.distributed.init_process_group` calls, used to avoid GPU socket timeouts when

	ddp_timeout (`int`, optional, defaults to `1800`):
	ddp_timeout (`int`, optional, defaults to 1800):

Adds timeout argument to training_args to avoid socket timeouts in DDP #18562

Adds timeout argument to training_args to avoid socket timeouts in DDP #18562

Conversation

gugarosa commented Aug 10, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Aug 10, 2022 • edited Loading

LysandreJik commented Aug 16, 2022

gugarosa commented Aug 16, 2022

sgugger left a comment

Choose a reason for hiding this comment

sgugger Aug 31, 2022

Choose a reason for hiding this comment

sgugger Aug 31, 2022

Choose a reason for hiding this comment

gugarosa Aug 31, 2022

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

sgugger Sep 1, 2022

Choose a reason for hiding this comment

sgugger Sep 1, 2022

Choose a reason for hiding this comment

sgugger commented Sep 1, 2022

gugarosa commented Sep 1, 2022

gugarosa commented Aug 10, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 10, 2022 •

edited

Loading