feat: support tensor parallel & Data loader #3173

kmehant · 2024-10-16T10:34:17Z

What does this PR do?

Implements TorchTensorParallelPlugin to support TP with Pytorch 2.0. This work should be seen along with the PR feat: add support for tensor parallel using Pytorch transformers#34194.
Simplify Tensor Parallel implementation with PyTorch TP transformers#34184
Modifies dataloader to support passing same samples across TP ranks

Please review in conjunction with huggingface/transformers#34194

Results

See significant improvement in both memory and throughput compared against single gpu training, and FSDP across different settings (checkpointing on/off) and context lengths.

Done on two models

ibm-granite/granite-8b-code-base-128k
codellama/CodeLlama-7b-hf

Tables below show the max cuda memory and throughput for various configurations showing the potential of TP contributed in this PR. There is gains in both memory and throughput.

Note: Please be aware that the effective TPS for FSDP would be multiplicative of the parallel factor (number of GPUs/devices engaged in distributed training) whereas that is not the case with TP. Therefore, when effective throughput is considered we can find FSDP is better than TP in terms of throughput. However, that may be compensated by increasing the batch size utilizing the memory gains etc.

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
ibm-granite/granite-8b-code-base-128k	Single GPU non-distributed	1	8192	1	FALSE	OOM	NA
ibm-granite/granite-8b-code-base-128k	FSDP	4	8192	1	FALSE	OOM	NA
ibm-granite/granite-8b-code-base-128k	TP (This PR)	4	8192	1	FALSE	52.4	7675.4

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
ibm-granite/granite-8b-code-base-128k	Single GPU non-distributed	1	8192	1	TRUE	OOM	NA
ibm-granite/granite-8b-code-base-128k	FSDP	4	8192	1	TRUE	29.975586	2256.896
ibm-granite/granite-8b-code-base-128k	TP (This PR)	4	8192	1	TRUE	26.5	5935.5

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
ibm-granite/granite-8b-code-base-128k	Single GPU non-distributed	1	16384	1	FALSE	OOM	NA
ibm-granite/granite-8b-code-base-128k	FSDP	4	16384	1	FALSE	OOM	NA
ibm-granite/granite-8b-code-base-128k	TP (This PR)	4	16384	1	FALSE	OOM	NA

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
ibm-granite/granite-8b-code-base-128k	Single GPU non-distributed	1	16384	1	TRUE	OOM	NA
ibm-granite/granite-8b-code-base-128k	FSDP	4	16384	1	TRUE	36.8	2084.864
ibm-granite/granite-8b-code-base-128k	TP (This PR)	4	16384	1	TRUE	33.5	5692.5

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
codellama/CodeLlama-7b-hf	Single GPU non-distributed	1	8192	1	FALSE	OOM	NA
codellama/CodeLlama-7b-hf	FSDP	4	8192	1	FALSE	70.7	3560
codellama/CodeLlama-7b-hf	TP (This PR)	4	8192	1	FALSE	42.8	9216

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
codellama/CodeLlama-7b-hf	Single GPU non-distributed	1	8192	1	TRUE	75.3	2849
codellama/CodeLlama-7b-hf	FSDP	4	8192	1	TRUE	26.4	5957
codellama/CodeLlama-7b-hf	TP (This PR)	4	8192	1	TRUE	21.4	7125

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
codellama/CodeLlama-7b-hf	Single GPU non-distributed	1	16384	1	FALSE	OOM	NA
codellama/CodeLlama-7b-hf	FSDP	4	16384	1	FALSE	OOM	NA
codellama/CodeLlama-7b-hf	TP (This PR)	4	16384	1	FALSE	OOM	NA

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
codellama/CodeLlama-7b-hf	Single GPU non-distributed	1	16384	1	TRUE	75.3	2599
codellama/CodeLlama-7b-hf	FSDP	4	16384	1	TRUE	30.1	2433
codellama/CodeLlama-7b-hf	TP (This PR)	4	16384	1	TRUE	26.6	6873

Fixes # (issue)
huggingface/transformers#32470

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

I have cycles to bring in more improvements over this PR to bring in Pytorch TP support to HF. Looking forward. Thank you

muellerzr

Thanks! This looks great to me. We do still need to update this to work with accelerate config however, whcih happens in commands/config and commands/launch. Would you like to do so?

HuggingFaceDocBuilderDev · 2024-10-29T13:01:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr · 2024-10-29T13:05:44Z

@kmehant if you rebase from main this should fix the failures (tl;dr we had py 3.8 EOL)

kmehant · 2024-10-29T16:04:31Z

@muellerzr Appreciate your response. I would like to bring to your notice the below two points.

This dataloader written to work for the paradigm (call it paradigm 1) of master process fetching the data needed and distributing them to all the worker processes. The more general paradigm (call it paradigm 2) of all the processes fetching their own data sample in TP case it has to be the same batch across the processes is not covered in this PR.
This PR has a soft dependency to apply TP plan over the model since this PR is more like of 2 parts - TP workflow through accelerate plugin + dataloader.
1. First part of the PR applies TP parallelism to the model like shown here - https://github.com/huggingface/accelerate/pull/3173/files#diff-2d7515874eaecac2687c7fc1a9c720be53f802bf14b4c3dcebe14ad443d075dcR1467 creating a soft dependency over feat: add support for tensor parallel using Pytorch transformers#34194 (Part of this would be superseded by Simplify Tensor Parallel implementation with PyTorch TP transformers#34184 that is carrying a different interface to apply TP plan to the model).
2. second part is the dataloader

For point (1) I can keep this PR simple and allow only for the paradigm 1 and address the paradigm 2 in another PR.
For point (2) I can remove application of TP part from this PR, keeping this simple and independent. The part removed can be added in a separate PR as point (2)(i) is completed.

WDYT?

BenjaminBossan

Thanks for this PR, this looks nice. I have a few smaller comments, please take a look.

Also, please ensure that make quality passes.

BenjaminBossan · 2024-10-29T16:55:22Z

src/accelerate/accelerator.py

@@ -1457,6 +1463,8 @@ def prepare_model(self, model: torch.nn.Module, device_placement: bool = None, e
                    )
                    if self.ddp_handler is not None:
                        self.ddp_handler.register_comm_hook(model)
+            elif self.distributed_type == DistributedType.TP:
+                model.apply_tensor_parallel(self.state.torch_tp_plugin.torch_device_mesh["tp"])


apply_tensor_parallel will be implemented in huggingface/transformers#34194 but only for select model architectures, right? Should we check this and if not present, raise an appropriate error?

Hi @BenjaminBossan

The tensor_parallel() interface will be implemented here - https://github.com/huggingface/transformers/pull/34184/files#diff-6b72b98c4c2dcfc6cc606843917733f5d858374fbc22a735ff483bbc0c1e63eaR5017

I have raised a comment on providing a way to know if tensor_parallel succeeded or not. Once that PR is ready, we can handle it here. WDYT?

Okay, let's see what the final result will be. But we could also check hasattr(model, "apply_tensor_parallel") or would that not work?

@BenjaminBossan
The function tensor_parallel is being added to the parent class PretrainedModel so all the model classes would have this function irrespective of it being available or not for a model.

Ah I see, in that case it is crucial to add a method or attribute to check the support for TP.

@BenjaminBossan

has_tp_plan property is added, so updated the code here to fail when the model has no support thank you.

src/accelerate/data_loader.py

src/accelerate/utils/dataclasses.py

BenjaminBossan · 2024-10-29T17:04:03Z

src/accelerate/utils/dataclasses.py

+    )
+
+    def __post_init__(self):
+        from torch.distributed.device_mesh import init_device_mesh


Should we perform a check on the minimum PyTorch and transformers versions? Not sure if here is the best place or somewhere else, Zach?

I'm not 100% sure there, because ideally we'd have this API work with custom models and transformer ones. If we decide just transformers, yes we should guard

I see, good point. Still, torch could be checked, right?

Added torch version check thanks

kmehant · 2024-11-04T15:28:54Z

@muellerzr can I work on this #3173 (review) in a separate PR?

I have fetched and rebased my PR and addressed all the review comments thank you.

HoangCongDuc · 2024-11-17T01:34:06Z

This feature is really useful, thank you @kmehant. I wonder if it is possible to combine tensor parallel with data parallel after this PR, say, TP for same-node parallelism and DP for multi-node parallelism.

kmehant · 2024-11-17T01:58:50Z

This feature is really useful, thank you @kmehant. I wonder if it is possible to combine tensor parallel with data parallel after this PR, say, TP for same-node parallelism and DP for multi-node parallelism.

Hi @HoangCongDuc, support for that is in my TODOs but not covered in this PR, should be coming soon after discussing with HF. Thank you.

BenjaminBossan · 2024-11-18T16:03:52Z

src/accelerate/accelerator.py

@@ -1461,6 +1467,10 @@ def prepare_model(self, model: torch.nn.Module, device_placement: bool = None, e
                    )
                    if self.ddp_handler is not None:
                        self.ddp_handler.register_comm_hook(model)
+            elif self.distributed_type == DistributedType.TP:
+                if not model.has_tp_plan:


It appears that the attribute was renamed to supports_tp_plan? Maybe let's wait until that other PR is merged so that this one does not need to be adapted constantly.

@BenjaminBossan
Yes, it got modified. I have updated this PR again and also that PR to transformers is now merged :)

muellerzr

Thanks! Overall the code looks sound, what I'd appreciate however is if we could bring this the last 10% of the way through:

Actually implementing this in the CLI and setting the env variable up properly
Writing some tests (src/accelerate/test_utils/scripts/test_tensor_parallel.py IMO)

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant · 2024-12-14T08:42:15Z

@muellerzr @BenjaminBossan

Have implemented the CLI part
1. accelerate launch usage
2. accelerate config usage
Have added run test using existing scripts for TP through CLI

Let me know if I have missed out something. Thank you.

kmehant mentioned this pull request Oct 16, 2024

feat: add support for tensor parallel using Pytorch huggingface/transformers#34194

Open

5 tasks

kmehant changed the title ~~feat: support tensor parallel using Pytorch 2.0~~ feat: support tensor parallel using Pytorch 2.0 & Data loader Oct 24, 2024

kwen2501 mentioned this pull request Oct 25, 2024

Simplify Tensor Parallel implementation with PyTorch TP huggingface/transformers#34184

Merged

5 tasks

muellerzr approved these changes Oct 29, 2024

View reviewed changes

muellerzr requested a review from BenjaminBossan October 29, 2024 12:59

BenjaminBossan requested changes Oct 29, 2024

View reviewed changes

kmehant force-pushed the tp branch 5 times, most recently from da67cba to c096d40 Compare November 4, 2024 09:57

kmehant force-pushed the tp branch from c096d40 to 189e202 Compare November 15, 2024 17:10

BenjaminBossan reviewed Nov 18, 2024

View reviewed changes

kmehant force-pushed the tp branch from 189e202 to 9e86e7f Compare November 19, 2024 13:52

muellerzr reviewed Nov 20, 2024

View reviewed changes

kmehant force-pushed the tp branch from 9e86e7f to c7f1a3e Compare December 13, 2024 10:47

kmehant changed the title ~~feat: support tensor parallel using Pytorch 2.0 & Data loader~~ feat: support tensor parallel & Data loader Dec 13, 2024

kmehant force-pushed the tp branch 2 times, most recently from cf95d34 to c80c030 Compare December 13, 2024 10:49

feat: add dataloader for TP and n-dim parallel in non-dispatch mode

0991429

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the tp branch from c80c030 to 0991429 Compare December 13, 2024 10:55

feat: add support for CLI usage

780ae7b

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the tp branch 2 times, most recently from e08c364 to 0fe41dd Compare December 13, 2024 16:14

kmehant force-pushed the tp branch 7 times, most recently from 8f1071b to d7ed517 Compare December 13, 2024 19:42

feat: add tests for cli usage of TP and plugin

4923481

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the tp branch 3 times, most recently from 3d8eeca to d5cc290 Compare December 13, 2024 20:15

fix: add pad token when not present

6f016f5

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the tp branch from d5cc290 to 6f016f5 Compare December 13, 2024 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support tensor parallel & Data loader #3173

feat: support tensor parallel & Data loader #3173

kmehant commented Oct 16, 2024 •

edited

Loading

muellerzr left a comment

HuggingFaceDocBuilderDev commented Oct 29, 2024

muellerzr commented Oct 29, 2024

kmehant commented Oct 29, 2024

BenjaminBossan left a comment

BenjaminBossan Oct 29, 2024

kmehant Nov 4, 2024

BenjaminBossan Nov 4, 2024

kmehant Nov 4, 2024

BenjaminBossan Nov 4, 2024

kmehant Nov 15, 2024

BenjaminBossan Oct 29, 2024

muellerzr Oct 31, 2024

BenjaminBossan Oct 31, 2024

kmehant Nov 4, 2024

kmehant commented Nov 4, 2024

HoangCongDuc commented Nov 17, 2024

kmehant commented Nov 17, 2024

BenjaminBossan Nov 18, 2024

kmehant Nov 19, 2024

muellerzr left a comment

kmehant commented Dec 14, 2024

feat: support tensor parallel & Data loader #3173

Are you sure you want to change the base?

feat: support tensor parallel & Data loader #3173

Conversation

kmehant commented Oct 16, 2024 • edited Loading

What does this PR do?

Results

Before submitting

Who can review?

muellerzr left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Oct 29, 2024

muellerzr commented Oct 29, 2024

kmehant commented Oct 29, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmehant commented Nov 4, 2024

HoangCongDuc commented Nov 17, 2024

kmehant commented Nov 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muellerzr left a comment

Choose a reason for hiding this comment

kmehant commented Dec 14, 2024

kmehant commented Oct 16, 2024 •

edited

Loading