Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support tensor parallel using Pytorch 2.0 #3173

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kmehant
Copy link

@kmehant kmehant commented Oct 16, 2024

What does this PR do?

  1. Implements TorchTensorParallelPlugin to support TP with Pytorch 2.0. This work should be seen along with the PR feat: add support for tensor parallel using Pytorch 2.0 transformers#34194.
  2. Modifies dataloader to support passing same samples across TP ranks

Please review in conjunction with huggingface/transformers#34194

Results

See significant improvement in both memory and throughput compared against single gpu training, and FSDP across different settings (checkpointing on/off) and context lengths.

Done on two models

  1. ibm-granite/granite-8b-code-base-128k
  2. codellama/CodeLlama-7b-hf

Tables below show the max cuda memory and throughput for various configurations showing the potential of TP contributed in this PR. There is gains in both memory and throughput.

Model Method # of GPUs Context Length Batch Size Grad Checkpointing Cuda Max Mem (GiB) Tokens/Sec/GPU
ibm-granite/granite-8b-code-base-128k Single GPU non-distributed 1 8192 1 FALSE OOM NA
ibm-granite/granite-8b-code-base-128k FSDP 4 8192 1 FALSE OOM NA
ibm-granite/granite-8b-code-base-128k TP (This PR) 4 8192 1 FALSE 52.4 7675.4
Model Method # of GPUs Context Length Batch Size Grad Checkpointing Cuda Max Mem (GiB) Tokens/Sec/GPU
ibm-granite/granite-8b-code-base-128k Single GPU non-distributed 1 8192 1 TRUE OOM NA
ibm-granite/granite-8b-code-base-128k FSDP 4 8192 1 TRUE 29.975586 2256.896
ibm-granite/granite-8b-code-base-128k TP (This PR) 4 8192 1 TRUE 26.5 5935.5
Model Method # of GPUs Context Length Batch Size Grad Checkpointing Cuda Max Mem (GiB) Tokens/Sec/GPU
ibm-granite/granite-8b-code-base-128k Single GPU non-distributed 1 16384 1 FALSE OOM NA
ibm-granite/granite-8b-code-base-128k FSDP 4 16384 1 FALSE OOM NA
ibm-granite/granite-8b-code-base-128k TP (This PR) 4 16384 1 FALSE OOM NA
Model Method # of GPUs Context Length Batch Size Grad Checkpointing Cuda Max Mem (GiB) Tokens/Sec/GPU
ibm-granite/granite-8b-code-base-128k Single GPU non-distributed 1 16384 1 TRUE OOM NA
ibm-granite/granite-8b-code-base-128k FSDP 4 16384 1 TRUE 36.8 2084.864
ibm-granite/granite-8b-code-base-128k TP (This PR) 4 16384 1 TRUE 33.5 5692.5
Model Method # of GPUs Context Length Batch Size Grad Checkpointing Cuda Max Mem (GiB) Tokens/Sec/GPU
codellama/CodeLlama-7b-hf Single GPU non-distributed 1 8192 1 FALSE OOM NA
codellama/CodeLlama-7b-hf FSDP 4 8192 1 FALSE 70.7 3560
codellama/CodeLlama-7b-hf TP (This PR) 4 8192 1 FALSE 42.8 9216
Model Method # of GPUs Context Length Batch Size Grad Checkpointing Cuda Max Mem (GiB) Tokens/Sec/GPU
codellama/CodeLlama-7b-hf Single GPU non-distributed 1 8192 1 TRUE 75.3 2849
codellama/CodeLlama-7b-hf FSDP 4 8192 1 TRUE 26.4 5957
codellama/CodeLlama-7b-hf TP (This PR) 4 8192 1 TRUE 21.4 7125
Model Method # of GPUs Context Length Batch Size Grad Checkpointing Cuda Max Mem (GiB) Tokens/Sec/GPU
codellama/CodeLlama-7b-hf Single GPU non-distributed 1 16384 1 FALSE OOM NA
codellama/CodeLlama-7b-hf FSDP 4 16384 1 FALSE OOM NA
codellama/CodeLlama-7b-hf TP (This PR) 4 16384 1 FALSE OOM NA
Model Method # of GPUs Context Length Batch Size Grad Checkpointing Cuda Max Mem (GiB) Tokens/Sec/GPU
codellama/CodeLlama-7b-hf Single GPU non-distributed 1 16384 1 TRUE 75.3 2599
codellama/CodeLlama-7b-hf FSDP 4 16384 1 TRUE 30.1 2433
codellama/CodeLlama-7b-hf TP (This PR) 4 16384 1 TRUE 26.6 6873

Fixes # (issue)
huggingface/transformers#32470

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

I have cycles to bring in more improvements over this PR to bring in Pytorch TP support to HF. Looking forward. Thank you

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant