Simplify Tensor Parallel implementation with PyTorch TP #34184

kwen2501 · 2024-10-15T23:34:27Z

What does this PR do?

This PR uses the torch.distributed.tensor.parallel subpackage to implement Tensor Parallel for Llama (as an example).

The motivation is multi-fold:

to make modeling code simple as single-worker case:
all manual TP implementations under if self.config.pretraining_tp > 1 can be removed.
to make tensor parallelism easily accessible by users:
added a model.tensor_parallel(device_mesh) method that allows users to turn a single-proc model into a parallel model. !- Please guide me to a right place to put this function/method if PreTrainedModel is not a preferred place. -!

Note:

This is just a demo. The removal of if self.config.pretraining_tp > 1 may break workflows elsewhere, but it is just intended to calculate how much code can be saved, and hopefully it would be possible to direct it to the mechanism in this PR.

User script:

import os
import torch
import torch.distributed as dist

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

def main(rank, world_size) -> None:
    device = torch.device(f"cuda:{rank}")
    dist.init_process_group("nccl", device_id=device)

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        tp_plan="auto",
    )

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    prompt = "Can I help"

    inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    outputs = model(inputs)

    next_token_logits = outputs[0][:, -1, :]
    next_token = torch.argmax(next_token_logits, dim=-1)
    response = tokenizer.batch_decode(next_token)
    print(f"{rank=}: {response}")


if __name__ == "__main__":
    main(int(os.environ["RANK"]), int(os.environ["WORLD_SIZE"]))

Launch command:
torchrun --standalone --nproc-per-node 4 tp_hf.py

Output:

Can I help
rank=3: [' you']
rank=1: [' you']
rank=2: [' you']
rank=0: [' you']

Test

CUDA_VISIBLE_DEVICES=0,1 pytest -sv tests/tp/test_tp.py

Doc

perf_infer_gpu_multi.md

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Models:

text models: @ArthurZucker @younesbelkada @fxmarty

Library:

trainer: @muellerzr and @SunMarc

Integrations:

Big Model Inference: @SunMarc

HF projects:

accelerate: different repo

Cc PyTorch Distributed Team:
@gnadathur @wanchaol @fduwjj @wz337 @wconstab @tianyu-l

kwen2501 · 2024-10-15T23:37:26Z

Cc: @lessw2020 @HamidShojanazeri
Cc: @jerryzh168

ArthurZucker

Looks really really nice! I think the only thing I would change is to have the tp_plan loadable and potentially defined in the config. Given that it does not hold any weights, it would allow other frameworks to rely on it as well! 🤗

ArthurZucker · 2024-10-21T11:18:03Z

Also we would place

    device = torch.device(f"cuda:{rank}")
    dist.init_process_group("nccl", device_id=device)
    device_mesh = dist.init_device_mesh("cuda", (world_size,))

    with device:
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
        )

potentially inside from_pretrained, this way we automatically set things for users, with them only having to change the way they call their script (with torchrun or not)

ArthurZucker · 2024-10-21T11:18:07Z

WDYT?

ArthurZucker · 2024-10-22T12:51:45Z

We will also add this to the configuration_llama.py:

DEFAULT_TP_PLAN = {
    "model.layer.*.q_proj":"column_parallel"
}

this way we can init lama config with it!

ArthurZucker · 2024-10-22T12:51:58Z

cc @SunMarc as per our discussion!

kwen2501 · 2024-10-23T17:28:34Z

Thanks @ArthurZucker @SunMarc for your review.

Good idea to move tp_plan to configuration_llama.py. I made that change in the current version. Indeed modeling_llama.py becomes cleaner now, and it will not have dependency on torch.distributed....

Since the top-level model can have different base_model_prefix ("model", "transformer", etc), I could not directly put FQNs like "model.layers.*" in configuration_llama.py. So I prepare a _base_model_tp_plan starting with "layers.*" facing the base model LlamaModel. This works because model.tensor_parallel() searches for and applies _tp_plan recursively. That way we can serve all top models regardless of what prefix name they give to LlamaModel, and we can also serve a bare-metal LlamaModel.

I also made the TP styles "type neutral" in the config as you suggested. Translation to torch TP types is done only when applying TP.

kwen2501 · 2024-10-23T17:49:00Z

Re moving distributed / device_mesh init code inside from_pretrained:

It is a good direction in making the workflow even simpler for users. As in, wouldn't it be nice if the model is distributed after it is created? I think a UI design would be helpful as to how users express the intention of "please create the model in a distributed manner".

For example, would from_pretrained accepts a related argument, like:

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_mesh=device_mesh,
)

If that kwarg is given, I think the intention is clear enough, and we can call tensor_parallel() for the user. Or even better, initialize partial weights only -- this is supported by the proposed torch TP too, but it will require slightly more work in integration with the weight loading part of from_pretained. I can try making a demo in a next PR if you are interested.

Re launcher: I might slightly prefer the above "explicit expression" to "implicit dependency" on torchrun, for keeping HF transformers "launcher neutral", as some users may use SLURM, mp.spawn, etc. Another reason was that init_process_group and init_device_mesh can take some kwargs (such as timeout, options, etc), which might be a bit big for from_pretrained to absorb if it wants to call them inside.

Regardless, I agree that it would be very nice if we can create distributed models out of `from_pretrained" directly, and I am really excite to help that way :)

ArthurZucker

Looks a lot better!
I think we want to have some small tests to make sure we properly serialize the TP-Plan, and also properly dispatch the model (slow tests, we can probably add them ourselves!)

would need some documentation as well, we can do that in a follow up PR for sure

On of the most important question now is: What's the approach regarding ForCausalLM.

This might be for @SunMarc as it's transformers internal:

_base_model_tp_plan will be consumed by the PreTrainedModel ➡ this way we can enforce some checks:

check that all models that inherit from PreTrainedModel have a TP plan
check nesting (ForCausalLM has a module that has a TP Plan, if it has other module they should have a TP plan

ArthurZucker · 2024-10-24T08:23:29Z

src/transformers/pytorch_utils.py

+def translate_to_torch_parallel_style(style: str):
+    """
+    In model configurations, we use a neutral type (string) to specify parallel
+    styles, here we translate them into torch.distributed tensor-parallel
+    types.
+    """
+    if not isinstance(style, str):
+        raise ValueError(
+            f"Unsupported parallel style type {type(style)}, expected str"
+        )
+
+    if style == "colwise":
+        return ColwiseParallel()
+    elif style == "rowwise":
+        return RowwiseParallel()
+    elif style == "colwise_rep":
+        return ColwiseParallel(output_layouts=Replicate())
+    else:
+        raise ValueError(f"Unsupported parallel style value: {style}")


A mapping from tp style to the correct function might be better.
We are also thinking of defining a TensorParallelConfig, your feedback is welcome here, as I don't know the variety of classes / args that might be used!

Thanks for the comment! Indeed a mapping style would look better.
The only caveat is that the returned value here is an object rather than a constant or class (see the () behind ColwiseParallel). If we prepare a map, we may be always returning the same object -- the parallelize_module API may be able to support it I guess, I am just not sure if there is a contract guaranteeing that today.

Pitching in :)

We should be able to use the same object since it applies required parallel operation to the module and returns a new copy - https://github.com/pytorch/pytorch/blob/86d4b7d60b264cae5a04a1b20719bcd7a5752a4c/torch/distributed/tensor/parallel/api.py#L95

Have also tested it empirically while benchmarking (#34194)

Thanks!

SOunds good!

kmehant · 2024-10-25T07:29:46Z

src/transformers/models/llama/configuration_llama.py

@@ -141,6 +141,16 @@ class LlamaConfig(PretrainedConfig):

    model_type = "llama"
    keys_to_ignore_at_inference = ["past_key_values"]
+    # Default tensor parallel plan for base model `LlamaModel`
+    _base_model_tp_plan = {


Do we want to allow this for external use by removing _ so that we can allow users to define tp plan tweaks from config.json?

Given that, shall we as well allow for providing custom tp plan as input to LlamaConfig() that overrides the default?

Yeah, good idea. We can make this public once we prove things work.

Yep, base_model_tp_plan should be supported as input to the PreTrainedConfig!

So, this variable base_model_tp_plan has to be added to PreTrainedConfig

transformers/src/transformers/configuration_utils.py

Line 50 in a769ed4

class PretrainedConfig(PushToHubMixin):

with a default value as an empty dict {} which I believe is best possible default for any config sub class inheriting.

Thanks @kmehant @ArthurZucker for the suggestion. I moved base_model_tp_plan to PretrainedConfig in the latest commit.

src/transformers/models/llama/modeling_llama.py

kwen2501 · 2024-10-25T17:43:25Z

Summarizing my discussion with @kmehant re collaboration plan between this PR and #34194:

Both PRs agree on the interest in having PyTorch TP in transformers :)
We agree that the integration can hopefully happen at a broad base so that it can scale to a set of models, thus we agree on picking PretrainedModel as an integration point.
#34194 has changes on the trainer side. And there is a corresponding PR for changes on Accelerate side. So those two PRs focus on workflow-level integration. So hopefully this PR can serve as a base for those two PRs and once landed, unblock the other two PRs.

ArthurZucker · 2024-10-28T11:15:45Z

Sounds good, let's focus on this one first then!

kwen2501 · 2024-10-30T08:04:46Z

Dumb question: is there a way to bypass the check_repository_consistency test? Thanks! @ArthurZucker @SunMarc

kwen2501 · 2024-10-30T08:36:27Z

Okay, I just temporarily disabled the "Copied from" comments in some derivative models. Will add them back once we extend TP plans to those models.

HuggingFaceDocBuilderDev · 2024-10-30T08:37:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Hey! So basically places where you are disabling it , you just need to do something like:
# Copied from transformers.models.llama.modeling_llama.LlamaModel with Llama->Olmoe,lm_head->new_name for example

and before anything run make fix-copies which would apply your changes!

kwen2501 · 2024-10-30T18:48:32Z

Thanks @ArthurZucker . I tried that method, but it doesn't seem to work.
Specifically, the test is checking for consistency between modeling_gemma.py and modular_gemma.py. I edited neither, but the CI complains that they are not consistent. Here is the link.

ArthurZucker · 2024-10-31T16:24:42Z

Ah that is because you probably modified modeling directly, when these two have a modular file, which has the source of truth!

kwen2501 · 2024-11-15T06:26:47Z

@ArthurZucker Thanks again for the suggestions.

A test has been added at tests/tp/test_tp.py, run command:

CUDA_VISIBLE_DEVICES=0,1 pytest -sv tests/tp/test_tp.py

A new doc has been added: perf_infer_gpu_multi.md, titled "Multi-GPU inference".
A use example and the benchmark figure have been added to the doc.
@ArthurZucker can you please help merge this image upload for doc? Thanks!
@kmehant The .has_tp_plan property has been added.

ArthurZucker · 2024-11-15T21:32:37Z

Merged! let me review one last time 🤗

ArthurZucker

Great work! 🔥 some small things left to do but super close to merge!

docs/source/en/perf_infer_gpu_multi.md

docs/source/en/performance.md

ArthurZucker · 2024-11-15T21:35:31Z

src/transformers/modeling_utils.py

@@ -1442,6 +1449,9 @@ def post_init(self):
        """
        self.init_weights()
        self._backward_compatibility_gradient_checkpointing()
+        # If current model is a base model, attach `base_model_tp_plan` from config
+        if self.base_model is self:


Yeah, but self._tp_plan should be None only for the base model no?
We could maybe add something like if not self.base_model is self and self._tp_plan is None and self.supports_tp_plan raise an error in the futur, to force people to add the TP plan.

src/transformers/modeling_utils.py

tests/tp/test_tp.py

ArthurZucker · 2024-11-15T21:49:44Z

tests/tp/test_tp.py

LGTM, would be nice to have a generic way for all models that have support_tp_plan=True instead of just llama, setting everyone up for success in adding support for other models!

Yeah, both me and @kmehant are interested in extending this mechanism to other models. We can think of an way to "copy" the tp_plan to other models with similar architectures (just like how HF copies modeling code between similar models). Maybe covering all AutoModelForCausalLM would be a good next step. Follow-up PR will come soon :)

kwen2501

Thanks @ArthurZucker for the close look! I fixed the places you pointed out.

docs/source/en/performance.md

tests/tp/test_tp.py

src/transformers/modeling_utils.py

docs/source/en/perf_infer_gpu_multi.md

kwen2501 · 2024-11-16T04:05:26Z

tests/tp/test_tp.py

Yeah, both me and @kmehant are interested in extending this mechanism to other models. We can think of an way to "copy" the tp_plan to other models with similar architectures (just like how HF copies modeling code between similar models). Maybe covering all AutoModelForCausalLM would be a good next step. Follow-up PR will come soon :)

kwen2501 · 2024-11-18T17:19:22Z

@ArthurZucker CI is all green now. (Previous failure seems to come from infra instability). Is there anything else I should add before merge? Thanks!

ArthurZucker · 2024-11-18T18:51:45Z

Nope merging! 🤗

ArthurZucker · 2024-11-18T18:52:21Z

I wanted to check if torch._C.get_device_accelerator had been available for a while or not!

ArthurZucker · 2024-11-18T18:52:30Z

SUper kudos for the PR! 🚀

loadams · 2024-11-18T23:37:33Z

Hi @ArthurZucker and @kwen2501 - it seems this PR now requires that users have torch 2.5+, is that correct? When trying to run the DeepSpeed unit tests with the latest transformers and torch 2.3 I'm getting errors around not being able to import from torch.distributed.tensor:

from torch.distributed.tensor import Replicate

The specific error:

E   RuntimeError: Failed to import transformers.models.vision_encoder_decoder.modeling_vision_encoder_decoder because of the following error (look up to see its traceback):
E   Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
E   cannot import name 'Replicate' from 'torch.distributed.tensor' (/usr/local/lib/python3.10/dist-packages/torch/distributed/tensor/__init__.py)

Opened #34795 to track if that's better visibility wise.

kmehant · 2024-11-19T19:09:57Z

@ArthurZucker @kwen2501 Planning for a HF blog? I am in!

kwen2501 · 2024-11-19T19:43:09Z

@loadams Thanks for reporting it and sorry about the break.
I think that's because Replicate was previously in torch.distributed._tensor.
Let me open a PR to build backward compat support.
Cc: @wz337

ArthurZucker · 2024-11-25T14:48:06Z

Sure! Won't have bandwidth to start a blog but feel free to do it! 🤗

…34184) * Simplify Tensor Parallel implementation with PyTorch TP * Move tp_plan to config * Lint * Format and warning * Disable copy-from check * Conditionally get attr from config * make fix-copies * Move base_model_tp_plan to PretrainedConfig * Move TP into from_pretrained * Add device context for load * Do not serialize * Move _tp_plan setting to post_init * Add has_tp_plan * Add test_tp * Add 'Multi-gpu inference' doc * Add backward support for device type identification * Auto-detect accelerator * supports_tp_plan * copyright year * Fix copy

Simplify Tensor Parallel implementation with PyTorch TP

e60fb87

kwen2501 mentioned this pull request Oct 16, 2024

feat: add support for tensor parallel using Pytorch #34194

Open

5 tasks

ArthurZucker reviewed Oct 21, 2024

View reviewed changes

Move tp_plan to config

fd7f7c7

ArthurZucker reviewed Oct 24, 2024

View reviewed changes

ArthurZucker mentioned this pull request Oct 24, 2024

Add quantization_config in AutoModelForCausalLM.from_config() #26901

Open

kmehant reviewed Oct 25, 2024

View reviewed changes

src/transformers/models/llama/modeling_llama.py Show resolved Hide resolved

kmehant mentioned this pull request Oct 29, 2024

feat: support tensor parallel & Data loader huggingface/accelerate#3173

Open

5 tasks

kwen2501 added 3 commits October 29, 2024 19:27

Merge remote-tracking branch 'origin/main' into tp_llama

9224cab

Lint

79cc524

Format and warning

a2934b3

kwen2501 force-pushed the tp_llama branch from a395944 to a2934b3 Compare October 30, 2024 07:56

Disable copy-from check

a8fc418

ArthurZucker reviewed Oct 30, 2024

View reviewed changes

Conditionally get attr from config

e84a388

kwen2501 added 3 commits November 14, 2024 21:54

Add 'Multi-gpu inference' doc

bd2e89c

Merge remote-tracking branch 'origin/main' into tp_llama

4892cef

Add backward support for device type identification

9648f31

ArthurZucker approved these changes Nov 15, 2024

View reviewed changes

kwen2501 added 3 commits November 15, 2024 19:43

Auto-detect accelerator

93ba283

supports_tp_plan

73524c9

copyright year

f312e55

kwen2501 commented Nov 16, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into tp_llama

ca93bdb

kwen2501 force-pushed the tp_llama branch from 06041fc to ca93bdb Compare November 17, 2024 04:18

kwen2501 added 2 commits November 18, 2024 08:40

Merge branch 'main' into tp_llama

dc2672f

Fix copy

1e27d6f

kwen2501 force-pushed the tp_llama branch from f46c907 to 1e27d6f Compare November 18, 2024 17:01

ArthurZucker merged commit 20142ab into huggingface:main Nov 18, 2024
22 checks passed

ArthurZucker mentioned this pull request Nov 18, 2024

Add Tensor Parallel support for ALL models #34789

Open

5 tasks

loadams mentioned this pull request Nov 19, 2024

Errors when using transformers with torch<2.5 #34795

Closed

1 task

Rocketknight1 mentioned this pull request Dec 9, 2024

Only import torch.distributed if it is available #35133

Merged

5 tasks

Nan2018 mentioned this pull request Dec 11, 2024

Improve tensor parallel memory usage #35202

Open

Simplify Tensor Parallel implementation with PyTorch TP #34184

Simplify Tensor Parallel implementation with PyTorch TP #34184

Conversation

kwen2501 commented Oct 15, 2024 • edited Loading

What does this PR do?

Note:

User script:

Test

Doc

Before submitting

Who can review?

kwen2501 commented Oct 15, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Oct 21, 2024

ArthurZucker commented Oct 21, 2024

ArthurZucker commented Oct 22, 2024

ArthurZucker commented Oct 22, 2024

kwen2501 commented Oct 23, 2024 • edited Loading

kwen2501 commented Oct 23, 2024 • edited Loading

ArthurZucker left a comment • edited by muellerzr Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwen2501 commented Oct 25, 2024

ArthurZucker commented Oct 28, 2024

kwen2501 commented Oct 30, 2024

kwen2501 commented Oct 30, 2024

HuggingFaceDocBuilderDev commented Oct 30, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

kwen2501 commented Oct 30, 2024

ArthurZucker commented Oct 31, 2024

kwen2501 commented Nov 15, 2024 • edited Loading

ArthurZucker commented Nov 15, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwen2501 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwen2501 commented Nov 18, 2024

ArthurZucker commented Nov 18, 2024

ArthurZucker commented Nov 18, 2024

ArthurZucker commented Nov 18, 2024

loadams commented Nov 18, 2024 • edited Loading

kmehant commented Nov 19, 2024

kwen2501 commented Nov 19, 2024

ArthurZucker commented Nov 25, 2024 • edited Loading

kwen2501 commented Oct 15, 2024 •

edited

Loading

kwen2501 commented Oct 15, 2024 •

edited

Loading

kwen2501 commented Oct 23, 2024 •

edited

Loading

kwen2501 commented Oct 23, 2024 •

edited

Loading

ArthurZucker left a comment •

edited by muellerzr

Loading

kwen2501 commented Nov 15, 2024 •

edited

Loading

loadams commented Nov 18, 2024 •

edited

Loading

ArthurZucker commented Nov 25, 2024 •

edited

Loading