Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make all Transformer models compatible with model parallelism #22561

Closed
41 tasks done
sgugger opened this issue Apr 4, 2023 · 27 comments · Fixed by #22703
Closed
41 tasks done

Make all Transformer models compatible with model parallelism #22561

sgugger opened this issue Apr 4, 2023 · 27 comments · Fixed by #22703

Comments

@sgugger
Copy link
Collaborator

sgugger commented Apr 4, 2023

Accelerate makes it easy to load a model on multiple GPUs with device_map="auto". This in turn allows users to train model with naive model parallelism if they have several GPUs.

A problem that happens in Transformers, with model with heads (so not XxxModel but for instance XxxModelForSequenceClassification) is that the labels end up on a different device than the logits and there is a device mistmatch error.

Thankfully, there is an easy fix for that! #22535 shows how to fix this for T5 by just moving the labels to the same device as the logits they are compared to. This is a noop when the devices are the same, and fixes the issue if devices are different.

We would like help from the community to extend this to all models that support model parallelism, which are:

  • BART
  • BigBirdPegasus
  • BLIP2
  • BLOOM
  • BridgeTower
  • CamemBERT
  • CLIP
  • CLIPSeg
  • CodeGen
  • Data2Vec Text
  • Deit
  • ESM
  • GPT-2
  • GPT-Neo
  • GPT-NeoX
  • GPT-NeoX Japanese
  • GPT-J
  • GPT-San
  • JukeBox
  • Lilt
  • LLaMA (LlamaForSequenceClassification only)
  • Longformer
  • LongT5
  • Luke
  • M2M100
  • mBART
  • mT5
  • NLLB
  • OPT
  • Owl-ViT
  • Pix2Struct
  • PLBART
  • RoBERTa
  • RoBERTa PreLayerNorm
  • SwitchTransformer
  • T5
  • Vilt
  • ViT
  • ViT-Hybrid
  • Whisper
  • XLM-RoBERTa

If you would like to grab one of those models and apply the same fix as #22535 to all the model with heads, please leave a comment here!

@muaid-mughrabi
Copy link

I think I can help with this Issue :)

@iamarunbrahma
Copy link
Contributor

iamarunbrahma commented Apr 4, 2023

I would like to work on this issue - BART model :)

@kausmeows
Copy link
Contributor

Hi, I can take this up 🙌🏻

@zsc
Copy link

zsc commented Apr 5, 2023

Indeed, this fix is required for BLOOM. main...zsc:transformers:main (my fix is hacky and not PR-ready. Just FYI)

@TerryCM
Copy link

TerryCM commented Apr 5, 2023

Just to make sure does LlamaForCausalLM supports this feature already?(#22546 ) it seems that, still there are some errors when using device_map="auto" for this task.

@pmollerus23
Copy link
Contributor

Hi, I'd like to pick up the GPT-2 model!

@xssChauhan
Copy link
Contributor

Hi! I am taking this up for LlamaForSequenceClassification.

@kooshi
Copy link
Contributor

kooshi commented Apr 6, 2023

Just to make sure does LlamaForCausalLM supports this feature already?(#22546 ) it seems that, still there are some errors when using device_map="auto" for this task.

It does (#22329). I have started seeing similar errors to #22546, but only after updating my drivers from 525 to 530, similar to #22546 (comment)

(which is good news to me, I had no idea why that gpu started disappearing occasionally. It seems it can happen when that gpu is under any load, not just during training)

Edit: seems like the errors I was getting were actually caused by GPU sag. I haven't yet reproduced that exact error, but it has been reported elsewhere. It is certainly not consistent though.

@innat
Copy link

innat commented Apr 7, 2023

@younesbelkada @sgugger
Does this fix (moving label/logit to same device) supposed to work (model parallelism) for all models (listed above)? Or, a crucial step toward it? Also, this design fix is only for pytorch model and not for jax or tf?

@younesbelkada
Copy link
Contributor

I think it is supposed to work for all models listed above, as long as you are loading your model with device_map=xxx. And yes this should be for Pytorch only, though I am not really aware of how model parallelism work on TF & Jax

@innat
Copy link

innat commented Apr 7, 2023

I think it is supposed to work for all models listed above, as long as you are loading your model with device_map=xxx

I tried with such fix here #22591 (comment) but sadly it didn't work out. Any catch?

@innat
Copy link

innat commented Apr 8, 2023

@sgugger
As the goal of this ticket is to enable model parallelism with easy fix, have the merged PR(s) checked on multi-gpu? I couldn't find any test script here #22663 regarding that .

@shahad-mahmud
Copy link
Contributor

I would love to work with BridgeTower

@trantuantdt
Copy link

Hi. I would like to try with "Whisper"

Asugawara added a commit to Asugawara/transformers that referenced this issue Apr 10, 2023
@pmollerus23
Copy link
Contributor

I'd like to claim OPT model if no one else has picked it up.

@mayankagarwals
Copy link
Contributor

Taking this up for the remaining GPT models

@jprivera44
Copy link
Contributor

jprivera44 commented Apr 11, 2023

Hello, I just completed the GPT-J code. Just filling in the PR now.

@oscar-garzon
Copy link
Contributor

Hello! I'd like to work in Whisper model

@sgugger sgugger reopened this Apr 14, 2023
@abhigyan631
Copy link

Hi, is there any model on which I can work, please? Thanks.

@Tanmaypatil123
Copy link
Contributor

Is there any remaining model on which I can work ? Thanks .

@JuheonChu
Copy link
Contributor

@sgugger Hello, can I work on the JukeBox?

@elabongaatuo
Copy link
Contributor

elabongaatuo commented Apr 18, 2023

Hello @sgugger , I'd like to work on m2m100

@Batese2001
Copy link
Contributor

@sgugger I would love to work on CodeGen if it is unclaimed

@katiele47
Copy link
Contributor

Hi @sgugger I can work on Luke if it has not been taken

@VomV
Copy link

VomV commented Apr 23, 2023

@sgugger I would like to work on SwitchTransformer, if not taken.

@sushmanthreddy
Copy link
Contributor

sushmanthreddy commented Apr 25, 2023

@sgugger I think all transformers are covered, I have checked for others also...for example, switch transformers have parallelism implemented already. i think we can close this issue. The only pending models are clip,jukebox,owlvit, and Nllb , may be model parallelism is not applicable for some of there models

@sgugger
Copy link
Collaborator Author

sgugger commented Apr 25, 2023

Indeed, all models have been covered. Thanks a lot everyone!

@sgugger sgugger closed this as completed Apr 25, 2023
novice03 pushed a commit to novice03/transformers that referenced this issue Jun 23, 2023
* add GPTNeoXForSequenceClassification

* move the labels to logits.device (ref: huggingface#22561)

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment