Enhancing Hugging Face Models with Tensor Parallelism for Large-Scale Model Support 🚀 #32470

SeungyounShin · 2024-08-06T14:47:44Z

Feature request

Description

This feature proposal aims to update Hugging Face's support for tensor parallelism (TP) to accommodate the increasing size and complexity of models such as LLaMA 3.1, Nemotron-4-340B-Instruct, and others, which have surpassed the capabilities of current training frameworks like TRL + DeepSpeed.

Currently, the Hugging Face codebase is outdated concerning these advancements. Although parallelism requires careful customization based on hardware setup, dataset size, sequence length, and model size, implementing TP across many Hugging Face models is crucial.

Proposal

With the introduction of tensor parallelism in PyTorch 2.0, the previous method of creating processes per device and model in the Megatron style is no longer efficient.

Key Changes:

Refactoring Code for TP:
- Remove the use of kwargs in favor of more straightforward TP implementations, as PyTorch parallel plans do not accommodate kwargs.
- Refactor PyTorch models to incorporate TP effectively.
Current Limitations:
- Existing implementations, such as in modeling_llama, are not trainable and are incompatible with torch.compile for inference optimization.
Future Integration:
- As models scale to large sizes, 8-way Tensor Parallel is becoming standard.
- This change would enable Accelerate to later support TP + FSDP (Fully Sharded Data Parallel), which many users could benefit from.

Personal Contribution

I have personally developed code that allows LLaMA to run entirely with TP, observing that it handles longer token sequences with less memory than FSDP. However, I have not submitted a pull request due to the need for comprehensive code refactoring.

Call to Action

If Hugging Face acknowledges this need, I am willing to contribute further if there is an overarching plan for abstraction and integration.

Motivation

The motivation behind this proposal is to address the limitations and frustrations experienced when using Hugging Face with the current parallelism approaches, especially for large-scale models like LLaMA 3.1 and Nemotron-4-340B-Instruct. As models grow in complexity, existing frameworks struggle to support efficient training and inference.

Current Issues with Existing Solutions:

NVIDIA Megatron-LM: Lacks compiler-level optimization and is somewhat outdated.
Tensor Parallel by BlackSamorez: Also lacks compiler-level optimization and is outdated.
DeepSpeed: Primarily uses data parallelism (DP), with ZeRO closer to model parallelism (MP) rather than tensor parallelism (TP). It also has issues with ZeRO Stage 3.
AWS Neuron Distributed: Potentially supports TP in distributed settings, though not tested extensively.
PyTorch Lightning: Implements TP but is not applicable to Hugging Face models.
NVIDIA NeMo: Uses PyTorch Lightning, underscoring the need for Hugging Face to adopt TP, including coding styles like avoiding kwargs.

Implementing tensor parallelism (TP) in Hugging Face models is crucial to keep up with the trend towards larger models and to enhance compatibility with modern optimization techniques like torch.compile.

Your contribution

Contribution

I am willing to contribute to implementing tensor parallelism (TP) within the Hugging Face ecosystem. To facilitate this, I would appreciate guidance on the following aspects:

Integration Approach: Clarification on whether TP should be applied during model initialization, such as with AutoModelForCausalLM, or if it should be managed externally using torchrun.
Automatic Initialization: Decide if the implementation should automatically initialize torch.distributed without requiring explicit commands from users.

With a defined plan or abstraction level, I can work on refactoring the necessary code and submit a pull request to integrate TP effectively. My experience with TP, particularly with LLaMA, has demonstrated its efficiency in handling large models with reduced memory usage compared to current methods.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-08-07T10:08:00Z

Hi @SeungyounShin, thanks for opening up this feature request and for writing up such a detailed report!

TP is definitely something we'd like to support. How easy that is will depend a bit on the necessary changes to modeling code as we need to be able to maintain backwards compatibility with our models and certain standards of readability etc. From the description above it sounds do-able!

Would you like to open a PR showing the updates for one model, and we can iterate from there?

cc @ArthurZucker @LysandreJik @muellerzr @SunMarc

SeungyounShin · 2024-08-07T14:23:29Z

Sounds good. However, as I'm currently busy with other commitments, I'll be progressing slowly. My goal is to wrap the Hugging Face models using PyTorch's core features to ensure compatibility with FSDP and torch.compile, allowing lowering at the compiler level while maintaining backward compatibility. I will start with LLaMA 3.1 first.

Additionally, I plan to allow users to easily apply tensor parallelism with a pre-configured plan. For example:

# Apply pre-configured tensor parallel plan
model = AutoModelForCausalLM.from_pretrained("model name", tensor_parallel='auto')

# Compiler-level optimization (I noticed that this saves almost 10GB for `LLaMA 3.1 8B`)
model = torch.compile(model)

# Backward compatibility (enables full fine-tuning of 16K context for `LLaMA 3.1 8B`)
output = model(prompt_ids)
loss_fn(output, label).backward()

# Generation (Utilizing available GPU)
model.generate(prompts)

# Saving (Priority: low)
## Since the model weight type is DTensor (Distributed), we need an all-gather operation before saving to disk
model.save_pretrained(...)

Both training and generation should work seamlessly with this setup.

cc @amyeroberts

muellerzr · 2024-08-08T12:50:33Z

Looking forward to it @SeungyounShin !

ArthurZucker · 2024-08-08T13:01:38Z

btw modeling llama does support compile.
Regarding tensor Parallel, a first thought is to potentially store a dictionannary that maps the layer's name with "row" and "column", which will be used on the fly when loading the checkpoint for TP.

Abstracting a little bit to prevent us from having to add code for new models, appart from the mapping!

{ "model.layers.1.mlp.down_proj": "column", ....}

something like that, which could go in the config.json directly

skyshine102 · 2024-09-04T06:16:12Z

Looking forward to this PR too!
TP style training requires seeing the same input data across TP rank while seeing different data in DP rank.
Does datasets & accelerator currently support this out of the box?

ArthurZucker · 2024-09-06T11:15:20Z

BTW there is some work here: huggingface/optimum#2011 .
I am not sure it's out of the box, you probably need a custom collator

zhenglongjiepheonix · 2024-09-06T22:33:09Z

I have been doing stuffs around automatic TP from fx graph modules traced by torch compile in the last three months, see https://github.com/huggingface/optimum/tree/main/optimum/fx/parallelization for details, basically the goal is to start from a transformers model like LlamaForCausalLM, trace it using torch dynamo, then analyze the traced graph to generate a TP plan, then replace the layers to their parallelized counterparts correspondingly. I think it's close to what you have proposed here, contributions are also welcome!

kmehant · 2024-09-09T06:03:29Z

I am thinking, the use of device mesh has to be adopted and exposed to some level to extend support for 2D (TP + DP/FSDP) and 3D (TP + FSDP + DP) parallelism with control.

Attaching an old discussion thread for similar support - #13690 (comment) #10321

zero-to-agi · 2024-10-04T08:34:07Z

I am looking forward to using TP in the transformers library. Is this task currently in progress?

ArthurZucker · 2024-10-05T14:08:46Z

No PRs are linked, we are internally debating as well given tools like nanotron, accelerate, deepspeed etc that work great. Our aim would probably be MP / TP at inference time. Debate is about whether or not we should have this live in transformers! 🤗 feedback from the community is most welcomed!

winglian · 2024-12-07T00:07:56Z

I am thinking, the use of device mesh has to be adopted and exposed to some level to extend support for 2D (TP + DP/FSDP) and 3D (TP + FSDP + DP) parallelism with control.

Attaching an old discussion thread for similar support - #13690 (comment) #10321

@kmehant I have an old PR here for device mesh support in transformers and a corresponding accelerate PR that we can re-open to get device mesh support in. I probably won't be able to get back to working on it until the new year though.

SeungyounShin added the Feature request Request for a new feature label Aug 6, 2024

This was referenced Oct 16, 2024

feat: support tensor parallel & Data loader huggingface/accelerate#3173

Open

feat: add support for tensor parallel using Pytorch #34194

Open

kmehant mentioned this issue Oct 30, 2024

Simplify Tensor Parallel implementation with PyTorch TP #34184

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing Hugging Face Models with Tensor Parallelism for Large-Scale Model Support 🚀 #32470

Enhancing Hugging Face Models with Tensor Parallelism for Large-Scale Model Support 🚀 #32470

SeungyounShin commented Aug 6, 2024

amyeroberts commented Aug 7, 2024

SeungyounShin commented Aug 7, 2024 •

edited

Loading

muellerzr commented Aug 8, 2024

ArthurZucker commented Aug 8, 2024 •

edited

Loading

skyshine102 commented Sep 4, 2024

ArthurZucker commented Sep 6, 2024

zhenglongjiepheonix commented Sep 6, 2024

kmehant commented Sep 9, 2024 •

edited

Loading

zero-to-agi commented Oct 4, 2024

ArthurZucker commented Oct 5, 2024

winglian commented Dec 7, 2024

Enhancing Hugging Face Models with Tensor Parallelism for Large-Scale Model Support 🚀 #32470

Enhancing Hugging Face Models with Tensor Parallelism for Large-Scale Model Support 🚀 #32470

Comments

SeungyounShin commented Aug 6, 2024

Feature request

Description

Proposal

Key Changes:

Personal Contribution

Call to Action

Motivation

Motivation

Current Issues with Existing Solutions:

Your contribution

Contribution

amyeroberts commented Aug 7, 2024

SeungyounShin commented Aug 7, 2024 • edited Loading

muellerzr commented Aug 8, 2024

ArthurZucker commented Aug 8, 2024 • edited Loading

skyshine102 commented Sep 4, 2024

ArthurZucker commented Sep 6, 2024

zhenglongjiepheonix commented Sep 6, 2024

kmehant commented Sep 9, 2024 • edited Loading

zero-to-agi commented Oct 4, 2024

ArthurZucker commented Oct 5, 2024

winglian commented Dec 7, 2024

SeungyounShin commented Aug 7, 2024 •

edited

Loading

ArthurZucker commented Aug 8, 2024 •

edited

Loading

kmehant commented Sep 9, 2024 •

edited

Loading