Training on distributed machine is slow. Using 8 Nvidia V100. #28

dimeldo · 2019-09-02T16:24:38Z

I'm using aws p3dn.24xlarge to train my data on 8 Nvidia V100 GPU's but the training seem slower than 1 GPU.

This is the config in train-horovod.py:

def train_main(dataset,
               model_name='345M',
               seed=None,
               batch_size=1,
               sample_length=1023,
               sample_num=1,
               sample_every=500,
               run_name='run1',
               restore_from='latest',
               save_every=1000,
               combine=50000):

That's the output, as you can see it takes a long time for each step. Trying to increase the batch size results in OOM.

[1 | 13.96] loss=3.12 avg=3.12
[2 | 16.30] loss=22.49 avg=12.85
[3 | 18.51] loss=8.58 avg=11.41
[4 | 20.70] loss=7.58 avg=10.44
[5 | 23.08] loss=7.59 avg=9.86
[6 | 25.48] loss=6.96 avg=9.36
[7 | 27.52] loss=6.34 avg=8.92
[8 | 29.85] loss=6.26 avg=8.58
[9 | 32.30] loss=5.86 avg=8.26
[10 | 34.31] loss=6.00 avg=8.02
[11 | 36.61] loss=5.78 avg=7.81
[12 | 38.94] loss=5.53 avg=7.61
[13 | 41.25] loss=5.32 avg=7.42
[14 | 43.69] loss=5.06 avg=7.24
[15 | 45.94] loss=6.06 avg=7.16
[16 | 48.34] loss=4.94 avg=7.01
[17 | 50.74] loss=5.16 avg=6.89
[18 | 53.10] loss=4.73 avg=6.76
[19 | 55.21] loss=4.54 avg=6.63
[20 | 57.56] loss=5.09 avg=6.55
[21 | 59.75] loss=4.66 avg=6.45
[22 | 62.22] loss=4.44 avg=6.35
[23 | 64.45] loss=4.40 avg=6.25
[24 | 66.68] loss=3.91 avg=6.14
[25 | 69.04] loss=3.79 avg=6.04

The text was updated successfully, but these errors were encountered:

HansBambel · 2019-09-04T14:50:35Z

I think distributed fine-tuning is not possible currently. As you can see it is labeld as "out-of-date". I think you are better off trying to fine-tune with the PyTorch implementation from PyTorch-Transformers.

shamiul94 · 2020-06-19T18:18:27Z

What command did you use to make it work? I am using aws ml.p3.8xlarge with four 16 GB V100 GPUs to train but getting OOM error. I am using this command.

mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib python /home/ec2-user/SageMaker/gpt-2/train-horovod.py --dataset /home/ec2-user/SageMaker/gpt-2/src/Dataset/MB23.npz --model_name /home/ec2-user/SageMaker/gpt-2/src/models/345M --batch_size 1

I am assuming my GPUs are not enough to train this 345M with even batch_size 1. Would using more GPU help me or multi-GPU training is just not possible?

HansBambel · 2020-06-20T07:36:34Z

What command did you use to make it work? I am using aws ml.p3.8xlarge with four 16 GB V100 GPUs to train but getting OOM error. I am using this command.
mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib python /home/ec2-user/SageMaker/gpt-2/train-horovod.py --dataset /home/ec2-user/SageMaker/gpt-2/src/Dataset/MB23.npz --model_name /home/ec2-user/SageMaker/gpt-2/src/models/345M --batch_size 1
I am assuming my GPUs are not enough to train this 345M with even batch_size 1. Would using more GPU help me or multi-GPU training is just not possible?

Are you talking about the PyTorch version? I was able to train the 345M version on a single V100.

shamiul94 · 2020-06-20T10:06:36Z

Are you talking about the PyTorch version?

No, I am using this code mentioned in this repository which is using Tensorflow instead of Pytorch. Does Pytorch version work better than the Tensorflow version? Also what Pytorch version are you talking about? (Any link would be helpful)

I was able to train the 345M version on a single V100.

Yes, I agree. I also tried to run the 345M model using train.py used in this very repository which also uses Tensorflow. It successfully ran this model on a single V100 but only for --batch_size 1. For batch_size more than 1, it failed. I am trying to find a way to increase batch_size value using multiple GPUs.
I was surprised to see that although this model could be trained on a single V100, I got ResourceExhausted error while trying it on multiple GPUs (4xV100). Shouldn't it be the opposite?

I have explained my issues in #53 and #52. It would be helpful if you go through these two issues too. Thank you.

HansBambel · 2020-06-20T10:59:08Z

I can recommend checking out Huggingface-transformers. When I was working with GPT-2 it was only for PyTorch, but they extended the repository to Tensorflow as well. There should be some examples of people doing exactly what you are trying as well.

Best of luck!

dimeldo · 2020-06-20T16:32:21Z

Hi @shamiul94! It's written in my notes that I used this command:

mpirun --allow-run-as-root -np 8 -H localhost:8 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib train-horovod.py --dataset data/train.npz --val_dataset data/valid.npz

Also, as others said, V100 can fit the gpt2-medium model with batch size 1.
I also recommend you to move to Hugginface's code like!

shamiul94 · 2020-06-21T10:54:22Z

Hi, @dimeldo! Huge thanks for your input!

Yes. But I would like to tweak the batch_size value too. That's why I am considering multi GPU training. Is it possible to set batch_size more than 1 if I use 8xV100? Or do I need more? Or is it not possible at all using Nshepperd's codebase?
I will definitely look into Hugginface's code. Thanks for the suggestion!
I am quite new to this 'multi GPU training' arena. I was getting OOM error while using 4xV100. Would it work if I use 8xV100? I am working on 345M. Which model were you working on?

dimeldo · 2020-06-22T11:17:21Z

I can't quite remember. I think the 345M one. I can't remember if multi-GPU worked out alright in the end or not. Good luck in your research!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on distributed machine is slow. Using 8 Nvidia V100. #28

Training on distributed machine is slow. Using 8 Nvidia V100. #28

dimeldo commented Sep 2, 2019

HansBambel commented Sep 4, 2019

shamiul94 commented Jun 19, 2020 •

edited

Loading

HansBambel commented Jun 20, 2020

shamiul94 commented Jun 20, 2020 •

edited

Loading

HansBambel commented Jun 20, 2020

dimeldo commented Jun 20, 2020 •

edited

Loading

shamiul94 commented Jun 21, 2020

dimeldo commented Jun 22, 2020

Training on distributed machine is slow. Using 8 Nvidia V100. #28

Training on distributed machine is slow. Using 8 Nvidia V100. #28

Comments

dimeldo commented Sep 2, 2019

HansBambel commented Sep 4, 2019

shamiul94 commented Jun 19, 2020 • edited Loading

HansBambel commented Jun 20, 2020

shamiul94 commented Jun 20, 2020 • edited Loading

HansBambel commented Jun 20, 2020

dimeldo commented Jun 20, 2020 • edited Loading

shamiul94 commented Jun 21, 2020

dimeldo commented Jun 22, 2020

shamiul94 commented Jun 19, 2020 •

edited

Loading

shamiul94 commented Jun 20, 2020 •

edited

Loading

dimeldo commented Jun 20, 2020 •

edited

Loading