Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Splitting big models over multiple GPUs #207

Open
zouharvi opened this issue Mar 5, 2024 · 6 comments
Open

[QUESTION] Splitting big models over multiple GPUs #207

zouharvi opened this issue Mar 5, 2024 · 6 comments
Labels
question Further information is requested

Comments

@zouharvi
Copy link

zouharvi commented Mar 5, 2024

When specifying the number of GPUs during inference, is it only for parallelism or is the model loaded piece-wise over multiple GPUs, if it's bigger than individual GPUs? For example I'd like to use XCOMET-XXL and our cluster has many 12GB GPUs.

At first I thought that the model parts will be loaded onto all GPUs, e.g.:

comet-score -s data/xcomet_ennl.src -t data/xcomet_ennl_T1.tgt --gpus 5 --model "Unbabel/XCOMET-XL"

However I'm getting GPU OOM on the first GPU:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacity of 10.75 GiB of which 11.62 MiB is free. ...
  1. Is it correct that in the above setting the model is being loaded in full 5 times on all 5 GPUs?
  2. Is there a way to split the model over multiple GPUs?

Thank you!

  • unbabel-comet 2.2.1
  • pytorch-lightning 2.2.0.post0
  • torch 2.2.1
@zouharvi zouharvi added the question Further information is requested label Mar 5, 2024
@zwhe99
Copy link

zwhe99 commented Mar 14, 2024

same question here

@ricardorei
Copy link
Collaborator

Last time I check this was not very easy to do with pytorch-lightning.

We actually used a custom made implementation with FSDP to train these larger models (without using pytorch-lightning). I have to double check if the new versions support FSDP better than the currently used pytorch lightning version (2.2.0.post0).

But short answer: model parallelism is not something we are supporting in the current codebase.

@vince62s
Copy link

idea here. Ctranslate2 just integrated tensor parallelism. It also support XMLRoberta, so just wondering if we could adapt a bit the converter so that we could run the model within CT2 which is very fast.
How different is it from XML Roberta at inference ?

@ricardorei
Copy link
Collaborator

Does it support XLM-R XL? the architecture also differs from XLM-R

@ricardorei
Copy link
Collaborator

It seems like they improved documentation a lot actually: https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/fsdp.html

@vince62s
Copy link

Does it support XLM-R XL? the architecture also differs from XLM-R

we can adapt if we have the detailed description somewhere.
cc @minhthuc2502

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants