Added instructions and script for distributed training with Horovod #2

tlkh · 2019-03-18T12:56:00Z

Added a script to show how to use Horovod to do multi-GPU or distributed training across machines.

Also added instructions and sample command in README.md

In addition, fixed a small type on README training command (./train should be ./train.py)

This enables multi-GPU or distributed training using Horovod

nshepperd · 2019-03-19T16:39:57Z

Awesome, thanks!

XinyuDu · 2019-06-21T06:13:24Z

Added a script to show how to use Horovod to do multi-GPU or distributed training across machines.

Also added instructions and sample command in README.md

In addition, fixed a small type on README training command (./train should be ./train.py)

Please consider using memory_saving_gradients for retraining 345M with horovod. THX.

shamiul94 · 2020-06-19T18:12:02Z

I tried to run train_horovod.py on a custom dataset using this command.

mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib python /home/ec2-user/SageMaker/gpt-2/train-horovod.py --dataset /home/ec2-user/SageMaker/gpt-2/src/Dataset/data.npz --model_name /home/ec2-user/SageMaker/gpt-2/src/models/345M --batch_size 1

This got OOM error. It could not allocate tensors of shape [12240, 840] and similar. I was using four 16GB V100 GPUs on AWS Sagemaker (aws ml.p3.8xlarge). What is going wrong here? Am I not using enough GPUs or is it not possible to multi-GPU train 345M?
I explained this problem in issue #53.

Added instructions and script for distributed training with Horovod

tlkh added 3 commits March 18, 2019 20:52

Add training script with Horovod support

3e18729

This enables multi-GPU or distributed training using Horovod

Fix typo in train command in README

ec16bad

Added instructions for training using Horovod

0bad9e4

nshepperd merged commit ef62678 into nshepperd:finetuning Mar 19, 2019

nshepperd added a commit that referenced this pull request Oct 31, 2022

Merge pull request #2 from tlkh/finetuning

3acf47e

Added instructions and script for distributed training with Horovod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added instructions and script for distributed training with Horovod #2

Added instructions and script for distributed training with Horovod #2

tlkh commented Mar 18, 2019

nshepperd commented Mar 19, 2019

XinyuDu commented Jun 21, 2019

shamiul94 commented Jun 19, 2020 •

edited

Loading

Added instructions and script for distributed training with Horovod #2

Added instructions and script for distributed training with Horovod #2

Conversation

tlkh commented Mar 18, 2019

nshepperd commented Mar 19, 2019

XinyuDu commented Jun 21, 2019

shamiul94 commented Jun 19, 2020 • edited Loading

shamiul94 commented Jun 19, 2020 •

edited

Loading