Skip to content
This repository has been archived by the owner on Oct 31, 2022. It is now read-only.

Training on distributed machine is slow. Using 8 Nvidia V100. #28

Open
dimeldo opened this issue Sep 2, 2019 · 8 comments
Open

Training on distributed machine is slow. Using 8 Nvidia V100. #28

dimeldo opened this issue Sep 2, 2019 · 8 comments

Comments

@dimeldo
Copy link

dimeldo commented Sep 2, 2019

I'm using aws p3dn.24xlarge to train my data on 8 Nvidia V100 GPU's but the training seem slower than 1 GPU.

This is the config in train-horovod.py:

def train_main(dataset,
               model_name='345M',
               seed=None,
               batch_size=1,
               sample_length=1023,
               sample_num=1,
               sample_every=500,
               run_name='run1',
               restore_from='latest',
               save_every=1000,
               combine=50000):

That's the output, as you can see it takes a long time for each step. Trying to increase the batch size results in OOM.

[1 | 13.96] loss=3.12 avg=3.12
[2 | 16.30] loss=22.49 avg=12.85
[3 | 18.51] loss=8.58 avg=11.41
[4 | 20.70] loss=7.58 avg=10.44
[5 | 23.08] loss=7.59 avg=9.86
[6 | 25.48] loss=6.96 avg=9.36
[7 | 27.52] loss=6.34 avg=8.92
[8 | 29.85] loss=6.26 avg=8.58
[9 | 32.30] loss=5.86 avg=8.26
[10 | 34.31] loss=6.00 avg=8.02
[11 | 36.61] loss=5.78 avg=7.81
[12 | 38.94] loss=5.53 avg=7.61
[13 | 41.25] loss=5.32 avg=7.42
[14 | 43.69] loss=5.06 avg=7.24
[15 | 45.94] loss=6.06 avg=7.16
[16 | 48.34] loss=4.94 avg=7.01
[17 | 50.74] loss=5.16 avg=6.89
[18 | 53.10] loss=4.73 avg=6.76
[19 | 55.21] loss=4.54 avg=6.63
[20 | 57.56] loss=5.09 avg=6.55
[21 | 59.75] loss=4.66 avg=6.45
[22 | 62.22] loss=4.44 avg=6.35
[23 | 64.45] loss=4.40 avg=6.25
[24 | 66.68] loss=3.91 avg=6.14
[25 | 69.04] loss=3.79 avg=6.04
@HansBambel
Copy link

I think distributed fine-tuning is not possible currently. As you can see it is labeld as "out-of-date". I think you are better off trying to fine-tune with the PyTorch implementation from PyTorch-Transformers.

@shamiul94
Copy link

shamiul94 commented Jun 19, 2020

What command did you use to make it work? I am using aws ml.p3.8xlarge with four 16 GB V100 GPUs to train but getting OOM error. I am using this command.

mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib python /home/ec2-user/SageMaker/gpt-2/train-horovod.py --dataset /home/ec2-user/SageMaker/gpt-2/src/Dataset/MB23.npz --model_name /home/ec2-user/SageMaker/gpt-2/src/models/345M --batch_size 1

I am assuming my GPUs are not enough to train this 345M with even batch_size 1. Would using more GPU help me or multi-GPU training is just not possible?

@HansBambel
Copy link

What command did you use to make it work? I am using aws ml.p3.8xlarge with four 16 GB V100 GPUs to train but getting OOM error. I am using this command.

mpirun -np 4 -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib python /home/ec2-user/SageMaker/gpt-2/train-horovod.py --dataset /home/ec2-user/SageMaker/gpt-2/src/Dataset/MB23.npz --model_name /home/ec2-user/SageMaker/gpt-2/src/models/345M --batch_size 1

I am assuming my GPUs are not enough to train this 345M with even batch_size 1. Would using more GPU help me or multi-GPU training is just not possible?

Are you talking about the PyTorch version? I was able to train the 345M version on a single V100.

@shamiul94
Copy link

shamiul94 commented Jun 20, 2020

Are you talking about the PyTorch version?

No, I am using this code mentioned in this repository which is using Tensorflow instead of Pytorch. Does Pytorch version work better than the Tensorflow version? Also what Pytorch version are you talking about? (Any link would be helpful)

I was able to train the 345M version on a single V100.

Yes, I agree. I also tried to run the 345M model using train.py used in this very repository which also uses Tensorflow. It successfully ran this model on a single V100 but only for --batch_size 1. For batch_size more than 1, it failed. I am trying to find a way to increase batch_size value using multiple GPUs.
I was surprised to see that although this model could be trained on a single V100, I got ResourceExhausted error while trying it on multiple GPUs (4xV100). Shouldn't it be the opposite?

I have explained my issues in #53 and #52. It would be helpful if you go through these two issues too. Thank you.

@HansBambel
Copy link

I can recommend checking out Huggingface-transformers. When I was working with GPT-2 it was only for PyTorch, but they extended the repository to Tensorflow as well. There should be some examples of people doing exactly what you are trying as well.

Best of luck!

@dimeldo
Copy link
Author

dimeldo commented Jun 20, 2020

Hi @shamiul94! It's written in my notes that I used this command:

mpirun --allow-run-as-root -np 8 -H localhost:8 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH=src -mca pml ob1 -mca btl ^openib train-horovod.py --dataset data/train.npz --val_dataset data/valid.npz

Also, as others said, V100 can fit the gpt2-medium model with batch size 1.
I also recommend you to move to Hugginface's code like!

@shamiul94
Copy link

Hi, @dimeldo! Huge thanks for your input!

  • Yes. But I would like to tweak the batch_size value too. That's why I am considering multi GPU training. Is it possible to set batch_size more than 1 if I use 8xV100? Or do I need more? Or is it not possible at all using Nshepperd's codebase?
  • I will definitely look into Hugginface's code. Thanks for the suggestion!
  • I am quite new to this 'multi GPU training' arena. I was getting OOM error while using 4xV100. Would it work if I use 8xV100? I am working on 345M. Which model were you working on?

@dimeldo
Copy link
Author

dimeldo commented Jun 22, 2020

I can't quite remember. I think the 345M one. I can't remember if multi-GPU worked out alright in the end or not. Good luck in your research!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants