[WIP] Support multiprocessing training #141

vandanavk · 2020-01-10T09:03:23Z

Description of changes:

Training a model by splitting the dataset across multiple processes on the same machine is considered distributed training. While using SM Debugger along with a distributed training script, the user can provide the option to save data from all workers or just 1 worker (include_workers in the hook).

Currently, this option is used in the following scenarios:

Framework	Supports
PyTorch	Horovod, torch.distributed
MXNet	Horovod
Tensorflow	MirroredWorkerStrategy
XGBoost	Rabit

The example in https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/dgl_kge uses Python multiprocessing for MXNet distributed training and torch.multiprocessing for PyTorch distributed training. Performing distributed training using multiprocessing is not yet handled by smdebug.

To handle this scenario, introducing env variables SMDEBUG_NUM_WORKERS and SMDEBUG_WORKER_NAME. User must specify these if they are using multiprocessing library in the training script.

The other alternatives considered were:

Modify the DGL example for PyTorch to use torch.distributed. But as specified in

sagemaker-debugger/tests/pytorch/test_distributed_training.py

Line 98 in 2433316

# Race condition here where both workers attempt to move

, this may also lead to a race condition. Also, this would mean that training scripts using multiprocessing will not work.
Append the PID to the collection and event file names. Issue with this:- smdebug does not know how many processes are there, so in cases such as, trials waiting for all files to be present, there will be an issue.

Style and formatting:

I have run pre-commit install to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Vikas-kum · 2020-01-10T23:47:19Z

tests/pytorch/test_distributed_training.py

+@pytest.mark.slow
+def test_run_net_distributed_multiproc_save_all_workers():
+    size = 2
+    os.environ["SMDEBUG_NUM_WORKERS"] = "2"


monkeypatch, see here -

sagemaker-debugger/tests/tensorflow/test_utils.py

Line 22 in c2d9acc

monkeypatch.setenv("TF_CONFIG", json.dumps({}))

Vikas-kum · 2020-01-17T18:40:04Z

tests/pytorch/test_distributed_training.py

+            if hasattr(dist, "is_initialized") and dist.is_initialized():
+                average_gradients(model)


reason for these changes?

when using the multiprocessing approach, torch.distributed is not used (init_process_group is not called). so, any reference to dist.get_rank or dist.get_world_size() will error out.

Vikas-kum · 2020-01-17T18:40:16Z

tests/pytorch/test_distributed_training.py

+    if hasattr(dist, "is_initialized") and dist.is_initialized():
+        assert hook._get_worker_name() == f"worker_{dist.get_rank()}"


Vikas-kum · 2020-01-17T18:40:43Z

tests/pytorch/test_distributed_training.py

+            if hasattr(dist, "is_initialized") and dist.is_initialized():
+                average_gradients(model)


Vikas-kum · 2020-01-23T20:26:55Z

@vandanavk updates?

vandanavk · 2020-01-28T20:38:44Z

In last sprint meeting, we decided to modify the example to use torch.distributed instead of multiprocessing directly. The solution in this PR doesn't fix all scenarios - example, with include_worker="one"

* Bugfix: Invalid Worker (#139) * smdistributed.dataparallel environment check * addressed comments * Modified check_smdataparallel_env logic Co-authored-by: Nihal Harish <nihal42harish@gmail.com> Co-authored-by: Karan Jariwala <karankjariwala@gmail.com>

vandanavk added 2 commits January 10, 2020 00:52

Support multiprocessing training

c57cc63

Fix pre-commit

42780f4

vandanavk requested a review from Vikas-kum January 10, 2020 09:03

vandanavk added 3 commits January 10, 2020 08:53

Fix pre-commit

24f4334

Check if torch.distributed is used

8df5198

Delete env var after test

9612919

Vikas-kum suggested changes Jan 10, 2020

View reviewed changes

Vikas-kum suggested changes Jan 17, 2020

View reviewed changes

use monkeypatch and workerrank env var

49ea13c

vandanavk closed this Jan 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Support multiprocessing training #141

[WIP] Support multiprocessing training #141

Uh oh!

vandanavk commented Jan 10, 2020 •

edited

Loading

Uh oh!

Vikas-kum Jan 10, 2020

Uh oh!

Vikas-kum Jan 17, 2020

Uh oh!

vandanavk Jan 17, 2020 •

edited

Loading

Uh oh!

Vikas-kum Jan 17, 2020

Uh oh!

Vikas-kum Jan 17, 2020

Uh oh!

Vikas-kum commented Jan 23, 2020

Uh oh!

vandanavk commented Jan 28, 2020

Uh oh!

Uh oh!

		if hasattr(dist, "is_initialized") and dist.is_initialized():
		average_gradients(model)

		if hasattr(dist, "is_initialized") and dist.is_initialized():
		assert hook._get_worker_name() == f"worker_{dist.get_rank()}"

[WIP] Support multiprocessing training #141

[WIP] Support multiprocessing training #141

Uh oh!

Conversation

vandanavk commented Jan 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes:

Style and formatting:

Issue number, if available

Uh oh!

Vikas-kum Jan 10, 2020

Choose a reason for hiding this comment

Uh oh!

Vikas-kum Jan 17, 2020

Choose a reason for hiding this comment

Uh oh!

vandanavk Jan 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vikas-kum Jan 17, 2020

Choose a reason for hiding this comment

Uh oh!

Vikas-kum Jan 17, 2020

Choose a reason for hiding this comment

Uh oh!

Vikas-kum commented Jan 23, 2020

Uh oh!

vandanavk commented Jan 28, 2020

Uh oh!

Uh oh!

vandanavk commented Jan 10, 2020 •

edited

Loading

vandanavk Jan 17, 2020 •

edited

Loading