Skip to content

Commit

Permalink
[docs] [testing] distributed training (#7993)
Browse files Browse the repository at this point in the history
* distributed training

* fix

* fix formatting

* wording
  • Loading branch information
stas00 authored Oct 26, 2020
1 parent c153bcc commit 101186b
Showing 1 changed file with 18 additions and 0 deletions.
18 changes: 18 additions & 0 deletions docs/source/testing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -451,6 +451,24 @@ Inside tests:
Distributed training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``pytest`` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right thing and end up thinking they are ``pytest`` and start running the test suite in loops. It works, however, if one spawns a normal process that then spawns off multiple workers and manages the IO pipes.

This is still under development but you can study 2 different tests that perform this successfully:

* `test_seq2seq_examples_multi_gpu.py <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_seq2seq_examples_multi_gpu.py>`__ - a ``pytorch-lightning``-running test (had to use PL's ``ddp`` spawning method which is the default)
* `test_finetune_trainer.py <https://github.com/huggingface/transformers/blob/master/examples/seq2seq/test_finetune_trainer.py>`__ - a normal (non-PL) test

To jump right into the execution point, search for the ``execute_async_std`` function in those tests.

You will need at least 2 GPUs to see these tests in action:

.. code-block:: bash
CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW=1 pytest -sv examples/seq2seq/test_finetune_trainer.py \
examples/seq2seq/test_seq2seq_examples_multi_gpu.py
Output capture
Expand Down

0 comments on commit 101186b

Please sign in to comment.