T5-11b model parallelism #7047

exelents · 2020-09-10T15:00:39Z

🚀 Feature request

I would like to finetune t5-11b model on my dataset, but found that it doesn't fit in TPU or GPU memory - colab notebook just crash when I run it.
I tried to find a ready model parallelism solution. First I found this PR:
#3578
but it seems it haven't released. I tried to merge it to master branch locally, and use it, but it's crashed.
Also I have found Eisen library that propose "model parallelism with one code line", but works only for models with only one input ( t5 have 2 inputs - tokens and mask).

I need to distribute model on several GPU, and I see somebody tried to perform it. If this development (pull request 3578) is still in process, can you tell is there are any plans to release it?

patrickvonplaten · 2020-09-10T15:18:59Z

Hey @exelents,

yes we are still looking into a good way of doing model parallelism. Could you post the error message you received when using #3578?

exelents · 2020-09-10T15:26:32Z

Here is it

-input-22-5591bd8e45c0> in main()
143 cache_dir=model_args.cache_dir,
144 )
--> 145 model = model.spread_on_devices(['cpu', 'cpu'])
146
147 # Get datasets

/usr/local/lib/python3.6/dist-packages/transformers/modeling_t5.py in spread_on_devices(self, devices)
936 return
937
--> 938 modules_to_move = set(self.modules)
939
940 # Evenly spread the blocks on devices

TypeError: 'method' object is not iterable

As I don't have several GPU at the moment, I tried to run it on CPU (see line 145 in error stack)

drpatrickkaggle · 2020-09-14T05:39:02Z

patrickvonplaten,

The following should be interesting.

https://www.microsoft.com/en-us/research/publication/training-large-neural-networks-with-constant-memory-using-a-new-execution-algorithm/

I have engaged them and they are planning to release the open source several months back but faces some issues with Microsoft internals. Heard the author is planning to release open source themselves.

Can anyone work with them?

Cheers,
Dr. Patrick

patrickvonplaten · 2020-09-20T18:50:54Z

That does look interesting. Thanks for sharing! I'm not sure if we are planning on working with the author - but feel free to reach out to him and maybe this can help resolve the T5 model parallelism.

exelents · 2020-09-20T19:43:31Z

Hello, guys.
As I still need to train t5-11b, and Google doesn't want to give me access to his TPU's despite I can pay for it... So I have made some changes to T5 model to make it live on several GPU simultaneously.
my fork: master...exelents:model_parallelism_t5

The point is: transformer blocks (T5Block) is most large parts of network. First step is to evenly spread them aross all GPUs. In the second step we spread across GPUs all other blocks of our transformer, that are incomparably smaller than main blocks. Also there are some modification of original model code to make tensors move to nesessary GPU when incoming tensor and a layer are on the different devices.
Unfortunately testing this code on 8-gpu server I found that first GPU is going to spend memory resource faster than others:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:17.0 Off | 0 |
| N/A 53C P0 65W / 300W | 16108MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:18.0 Off | 0 |
| N/A 53C P0 64W / 300W | 10224MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:19.0 Off | 0 |
| N/A 57C P0 63W / 300W | 10224MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1A.0 Off | 0 |
| N/A 51C P0 64W / 300W | 10224MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 51C P0 63W / 300W | 13296MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 56C P0 65W / 300W | 13296MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 52C P0 62W / 300W | 13296MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 51C P0 64W / 300W | 13548MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

It seems in the beginning of our graph we have a large block which have a size comparable to T5Block size. The smarter way would be to split layers according to these memory usage, but I don't know a simple way to know how much memory every module use.
Maybe a simple workaround would be to find which layer can use so much memory and provide it's memory in first step, with T5Block's.

What do you think about this?

exelents · 2020-09-23T10:27:18Z

I tested this script on a machine with 8x32GB GPUs and have seen the same symptoms - first gpu's memoru gets fully loaded while other GPUs consume around 5 gigabytes:
https://pastebin.com/cV3CAQMk
Looking on output of device assignation array I see that all layers get spreaded evenly, so I can't imagine why it consumes memory of only one GPU....
If somebody could help with this code - please tell me, I can prepare running script for you. Also, you can use my code with only one line of code:

rc = model.split_across_gpus(devices=['cuda:0', 'cuda:1','cuda:2','cuda:3', 'cuda:4', 'cuda:5', 'cuda:6', 'cuda:7',])
print(rc)

stale · 2020-11-24T02:58:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

LostBenjamin · 2021-01-13T19:53:12Z

Hi @exelents,

I also need model parallelism for T5 and your code should be very helpful. However, the link to your code seems invalid. Could you please share the code with me?

Best,
Jingxuan

exelents · 2021-01-13T20:12:20Z

Hello, @LostBenjamin.
Unfortunately, this my code didn't worked when I tested 11B model on 8 V100 GPU, so I didn't fixed it.
@alexorona did some work for model parallelism, here #9384 you can find a discussion about already existing MP in transformers library. It's about Bart, but the same functions exists in T5 model class too. There is a code to spread model on several GPUs:
model.parallelize() # autogenerated
inputs = inputs.to("cuda:0")

Also, you can try DeepSpeed:
https://github.com/exelents/try_t5_qa
I haven't used this code for model parallelism, but in DeepSpeed community people say MP is exists in this library. So maybe this repo would be helpful.

LostBenjamin · 2021-01-14T01:39:43Z

Hi @exelents,

Thanks for your help! I will try the MP in transformers library.

patrickvonplaten self-assigned this Sep 10, 2020

stale bot added the wontfix label Nov 24, 2020

stale bot closed this as completed Dec 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T5-11b model parallelism #7047

T5-11b model parallelism #7047

exelents commented Sep 10, 2020

patrickvonplaten commented Sep 10, 2020

exelents commented Sep 10, 2020

drpatrickkaggle commented Sep 14, 2020

patrickvonplaten commented Sep 20, 2020

exelents commented Sep 20, 2020 •

edited

Loading

exelents commented Sep 23, 2020

stale bot commented Nov 24, 2020

LostBenjamin commented Jan 13, 2021

exelents commented Jan 13, 2021

LostBenjamin commented Jan 14, 2021

T5-11b model parallelism #7047

T5-11b model parallelism #7047

Comments

exelents commented Sep 10, 2020

🚀 Feature request

patrickvonplaten commented Sep 10, 2020

exelents commented Sep 10, 2020

drpatrickkaggle commented Sep 14, 2020

patrickvonplaten commented Sep 20, 2020

exelents commented Sep 20, 2020 • edited Loading

exelents commented Sep 23, 2020

stale bot commented Nov 24, 2020

LostBenjamin commented Jan 13, 2021

exelents commented Jan 13, 2021

LostBenjamin commented Jan 14, 2021

exelents commented Sep 20, 2020 •

edited

Loading