Skip to content

Conversation

vandanavk
Copy link
Contributor

@vandanavk vandanavk commented Jan 10, 2020

Description of changes:

Training a model by splitting the dataset across multiple processes on the same machine is considered distributed training. While using SM Debugger along with a distributed training script, the user can provide the option to save data from all workers or just 1 worker (include_workers in the hook).

Currently, this option is used in the following scenarios:

Framework Supports
PyTorch Horovod, torch.distributed
MXNet Horovod
Tensorflow MirroredWorkerStrategy
XGBoost Rabit

The example in https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/dgl_kge uses Python multiprocessing for MXNet distributed training and torch.multiprocessing for PyTorch distributed training. Performing distributed training using multiprocessing is not yet handled by smdebug.

To handle this scenario, introducing env variables SMDEBUG_NUM_WORKERS and SMDEBUG_WORKER_NAME. User must specify these if they are using multiprocessing library in the training script.

The other alternatives considered were:

  1. Modify the DGL example for PyTorch to use torch.distributed. But as specified in
    # Race condition here where both workers attempt to move
    , this may also lead to a race condition. Also, this would mean that training scripts using multiprocessing will not work.
  2. Append the PID to the collection and event file names. Issue with this:- smdebug does not know how many processes are there, so in cases such as, trials waiting for all files to be present, there will be an issue.

Style and formatting:

I have run pre-commit install to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@vandanavk vandanavk requested a review from Vikas-kum January 10, 2020 09:03
@pytest.mark.slow
def test_run_net_distributed_multiproc_save_all_workers():
size = 2
os.environ["SMDEBUG_NUM_WORKERS"] = "2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monkeypatch, see here -

monkeypatch.setenv("TF_CONFIG", json.dumps({}))

Comment on lines +93 to +94
if hasattr(dist, "is_initialized") and dist.is_initialized():
average_gradients(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reason for these changes?

Copy link
Contributor Author

@vandanavk vandanavk Jan 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when using the multiprocessing approach, torch.distributed is not used (init_process_group is not called). so, any reference to dist.get_rank or dist.get_world_size() will error out.

Comment on lines +98 to +99
if hasattr(dist, "is_initialized") and dist.is_initialized():
assert hook._get_worker_name() == f"worker_{dist.get_rank()}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same her?

Comment on lines +93 to +94
if hasattr(dist, "is_initialized") and dist.is_initialized():
average_gradients(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reason?

@Vikas-kum
Copy link
Contributor

@vandanavk updates?

@vandanavk vandanavk closed this Jan 28, 2020
@vandanavk
Copy link
Contributor Author

In last sprint meeting, we decided to modify the example to use torch.distributed instead of multiprocessing directly. The solution in this PR doesn't fix all scenarios - example, with include_worker="one"

leleamol added a commit that referenced this pull request Dec 8, 2020
* Bugfix: Invalid Worker (#139)

* smdistributed.dataparallel environment check

* addressed comments

* Modified check_smdataparallel_env logic

Co-authored-by: Nihal Harish <nihal42harish@gmail.com>
Co-authored-by: Karan Jariwala <karankjariwala@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants