Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[T5] Support Distributed Training #4434

Merged
merged 2 commits into from
Mar 21, 2022
Merged

[T5] Support Distributed Training #4434

merged 2 commits into from
Mar 21, 2022

Conversation

klshuster
Copy link
Contributor

Patch description
Per #4430 , T5 was hanging with distributed calls. This was due to the forced setting of the CUDA device required for training with T5 model parallel. This is now a protected call.

Testing steps

  1. Tested with the command provided in Fine-tuning T5 models with multiprocessing_train #4430
  2. New distributed test for t5:
$ pytest test_t5.py
======test session starts ======
platform linux -- Python 3.7.9, pytest-5.3.2, py-1.10.0, pluggy-0.13.1
rootdir: /private/home/kshuster/ParlAI, inifile: pytest.ini
plugins: hydra-core-1.1.0, requests-mock-1.7.0, regressions-2.1.1, datadir-1.3.1
collected 9 items

test_t5.py .........                                                                                                                                                          [100%]

======slowest 10 test durations ======
97.49s call     tests/nightly/gpu/test_t5.py::TestT5Distributed::test_t5_distributed
60.84s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_t5_model_parallel
46.50s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_t5_ft
19.85s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_t5_gen
16.18s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_small
12.19s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_summarization
10.75s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_translation_en_to_fr
9.85s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_translation_en_to_de
9.79s call     tests/nightly/gpu/test_t5.py::TestT5Model::test_translation_en_to_ro

(0.00 durations hidden.  Use -vv to show these durations.)
======9 passed, 9 warnings in 290.46s (0:04:50) ======

Other information

Copy link
Contributor

@stephenroller stephenroller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lordy house of cards

@klshuster klshuster merged commit f036542 into main Mar 21, 2022
@klshuster klshuster deleted the t5_distribuetd branch March 21, 2022 19:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants