Skip to content
This repository has been archived by the owner on Oct 31, 2022. It is now read-only.

Added instructions and script for distributed training with Horovod #2

Merged
merged 3 commits into from
Mar 19, 2019
Merged

Conversation

tlkh
Copy link

@tlkh tlkh commented Mar 18, 2019

Added a script to show how to use Horovod to do multi-GPU or distributed training across machines.

Also added instructions and sample command in README.md

In addition, fixed a small type on README training command (./train should be ./train.py)

@nshepperd nshepperd merged commit ef62678 into nshepperd:finetuning Mar 19, 2019
@nshepperd
Copy link
Owner

Awesome, thanks!

@XinyuDu
Copy link

XinyuDu commented Jun 21, 2019

Added a script to show how to use Horovod to do multi-GPU or distributed training across machines.

Also added instructions and sample command in README.md

In addition, fixed a small type on README training command (./train should be ./train.py)

Please consider using memory_saving_gradients for retraining 345M with horovod. THX.

@shamiul94
Copy link

shamiul94 commented Jun 19, 2020

I tried to run train_horovod.py on a custom dataset using this command.

mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib python /home/ec2-user/SageMaker/gpt-2/train-horovod.py --dataset /home/ec2-user/SageMaker/gpt-2/src/Dataset/data.npz --model_name /home/ec2-user/SageMaker/gpt-2/src/models/345M --batch_size 1

This got OOM error. It could not allocate tensors of shape [12240, 840] and similar. I was using four 16GB V100 GPUs on AWS Sagemaker (aws ml.p3.8xlarge). What is going wrong here? Am I not using enough GPUs or is it not possible to multi-GPU train 345M?
I explained this problem in issue #53.

nshepperd added a commit that referenced this pull request Oct 31, 2022
Added instructions and script for distributed training with Horovod
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants