3outeille/transformers backend (Dense model only) #2048

3outeille · 2025-11-17T09:45:41Z

Context

This PR enables:

Llama-like HF models to work with 4D parallelism: FSDP, CP, TP, PP (and the combinations between them). The following models were tested:
- meta-llama/Llama-3.2-1B
- microsoft/phi-2
- Qwen/Qwen2.5-7B
- mistralai/Mistral-7B-v0.1
- ByteDance-Seed/Seed-Coder-8B-Instruct
- Qwen/Qwen3-4B-Instruct-2507
- arcee-ai/AFM-4.5B
- ibm-granite/granite-3b-code-base-2k
- baidu/ERNIE-4.5-0.3B-Base-PT
- kyutai/helium-1-preview-2b
- allenai/OLMo-7B-hf
- mistralai/Ministral-8B-Instruct-2410
Patching HF models weights initialisation. Without this, the the loss and grad_norm starts very high

Usage

Requirements transformers==4.57.1
Config: torchtitan/torchtitan/experiments/transformers_backend/configs/qwen3.toml

...
[model]
- name = "llama3"
+ name = "transformers_backend"
flavor = "debugmodel"
hf_assets_path = "./tests/assets/tokenizer"

+[hf_transformers]
+model = "Qwen/Qwen3-4B-Instruct-2507"
...

Train: LOG_RANK=7 CONFIG_FILE=<YOUR_PATH>/torchtitan/experiments/transformers_backend/configs/qwen3.toml ./run_train.sh --job.custom_config_module=torchtitan.experiments.transformers_backend.job_config --compile.enable

Testing methodology

Following the converging.md guidelines, I am comparing the baseline FSDP=2 vs FSDP=2 & <other //-ism>
More precisely, the test_hf_integration.pyis going to do:

    results/
        |_ meta-llama
            |_ Llama-3.2-1B
                |_ debugmodel/
                    |_ seed_checkpoint/
                        |_ config.toml
                        |_ seed.slurm
                        |_ step-0/
                           |_ ....
                    |_ fsdp2_tp1_cp1_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                    |_ fsdp2_tp2_cp1_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp1_pp2/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp2_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp2_pp2/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log`
                |_ full/
                ...

Here is the grid search to test the HF modelling

#!/usr/bin/bash
model_names=(
     "meta-llama/Llama-3.2-1B"
     "microsoft/phi-2" 
     "Qwen/Qwen2.5-7B"
     "mistralai/Mistral-7B-v0.1"
     "ByteDance-Seed/Seed-Coder-8B-Instruct"
     "Qwen/Qwen3-4B-Instruct-2507" 
     "arcee-ai/AFM-4.5B" 
     "ibm-granite/granite-3b-code-base-2k" 
     "baidu/ERNIE-4.5-0.3B-Base-PT" 
     "kyutai/helium-1-preview-2b" 
     "allenai/OLMo-7B-hf"
     "mistralai/Ministral-8B-Instruct-2410" 
)

for model_name in "${model_names[@]}"; do
    rm -rf slurm_results/${model_name}

    python test_hf_integration.py create_configs --model_name "$model_name" --out_dir slurm_results --flavor debugmodel
    python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel/seed_checkpoint --qos high
    while [ ! -f slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt ] || [ "$(cat slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt)" != "completed" ]; do
        echo "Waiting for seed checkpoint from ${model_name} to complete ..."
        sleep 1
    done
    python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel --qos high
    echo "================"
done

Further tasks

Moe (handle in PR Add transformer backend (MoE) clean huggingface/torchtitan#3)
- Missing build_optimizers_with_moe_load_balancing support for MoE
- Missing TP/PP/EP supports for MoE
When using HF modeling, the test FSDP=2 vs FSDP=2 + PP=2, the loss and grad_norm not bitwise matching (but converging) while it is the case with Torchtitan modeling. (issue is tracked in Fix pp convergence to be bitwise huggingface/torchtitan#4)
Add convergence tests to CI by doing tiny model + gloo backend (once PP is bitwise matching)
the HF modeling has lower MFU than Torchtitan MFU
NOTE: import torch._dynamo.config; torch._dynamo.config.cache_size_limit = 128 to avoid recomputation for graph when using torch.compile and activation checkpointing

…roper mapping

… gradnorm and less tps with HF model

…on backend)

…le/transformers_backend

meta-cla · 2025-11-17T09:45:50Z

Hi @3outeille!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

meta-cla · 2025-11-17T09:47:30Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

wwwjn

Thanks for the great work again, let some comments

wwwjn · 2025-11-17T18:41:06Z

torchtitan/distributed/pipeline_parallel.py

    num_layers: int,
    input_weight: int = 1,
    output_weight: int = 1,
+    include_rotary_emb: bool = False,


This change is not included in https://github.com/huggingface/torchtitan/pull/1/files , can you quickly remind me why we need to include rotary embedding when PP is applied?

And in torchtitan models, we make rotary_emb a function, not a module, but looks like for HF models, rotary_emb is a module, that's why we need this module being included in PP?

that was to address the issue you mentionned here: huggingface#1 (comment)

Can we modify and add a parameter in function signature in pytorch/torchtitan@main/torchtitan/distributed/pipeline_parallel.py#L41 instead of keep 2 copies? I feel it's very easy to get diverged in the future

And in torchtitan models, we make rotary_emb a function, not a module, but looks like for HF models, rotary_emb is a module, that's why we need this module being included in PP?

Yes exactly !

wwwjn · 2025-11-17T18:53:59Z

torchtitan/distributed/pipeline_parallel.py

-                setattr(model, module_name, None)
+                # Replace with Identity or None based on configuration
+                replacement = (
+                    nn.Identity() if use_identity_for_missing_modules else None


Could you quicly remind me why we need to use Identity() here?

I think it's because HF define their models without things like if toke_embeddings is None.

I still worry about such identities breaks DCP and could be the source of PP numerics issue. The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?

cc @fegin if you know this definitively.

The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?

Seems like PP ranks are restored perfectly because we have perfect match with Qwen but not with Llama for example (cf the screenshot at huggingface#4)

torchtitan/experiments/transformers_backend/configs/qwen3.toml

wwwjn · 2025-11-17T19:50:19Z

torchtitan/experiments/transformers_backend/infra/parallelize.py

+    )
+
+
+def apply_fsdp(


By reading this function, the function is the same as the apply_fsdp function in llama4/parallelize (I know we will keep MoE capability for next PR), can we reuse the apply_fsdp function from llama4 and avoid keeping multiple copies?

Oh I see the difference - The only difference is moe_block = transformer_block.mlp line 337, in transformers models, the MoE module is named mlp, instead of moe. Can we use the same getter/setter way to rename it in model.py, so we can reuse the apply_fsdp function from llama4.

I don't have strong opinion on this, but I'm a little bit concerned if we have several copies, they will become diverged easily in the future

Valid concern. i'll reuse fsdp from llama3 for now as this PR handles only dense. It will make more sense to handle the getter/setter in the MoE PR

torchtitan/experiments/transformers_backend/infra/pipeline.py

torchtitan/experiments/transformers_backend/__init__.py

torchtitan/experiments/transformers_backend/model/args.py

torchtitan/experiments/transformers_backend/tests/integration_tests.py

tianyu-l

Please address final comments.

.github/workflows/integration_test_8gpu_transformers_backend.yaml

torchtitan/distributed/utils.py

torchtitan/experiments/README.md

torchtitan/experiments/transformers_backend/tests/integration_tests.py

torchtitan/experiments/transformers_backend/model/model.py

torchtitan/protocols/train_spec.py

torchtitan/train.py

tianyu-l · 2025-11-18T07:48:13Z

torchtitan/distributed/pipeline_parallel.py

-                setattr(model, module_name, None)
+                # Replace with Identity or None based on configuration
+                replacement = (
+                    nn.Identity() if use_identity_for_missing_modules else None


I think it's because HF define their models without things like if toke_embeddings is None.

I still worry about such identities breaks DCP and could be the source of PP numerics issue. The concrete question is, when loading from seed checkpoint, are all the PP ranks restored perfectly?

cc @fegin if you know this definitively.

tianyu-l · 2025-11-18T07:50:12Z

torchtitan/distributed/pipeline_parallel.py

It sounds the changes are caused by specific ways transformers define models. Then let's fork the changed functions into experiments/transformers_backend/. I apologize for the back & forth.

but isnt the compromise good enough ? Copy pasting means not noticing changes in Pipeline parallel later on

3outeille added 30 commits August 28, 2025 08:05

create transformer_backend folder with debug run

7488385

add hf config

39a3b34

can now register train spec for hf model

ea7c594

can now switch with different flavors using HF Llama modeling

5f0adf5

it is now working up to apply_ac

7c3795c

now working up to init_weights

3fb2bf8

fix mapping when convert_to_hf_config + add breaking test to ensure p…

25daeca

…roper mapping

define own apply_ac for transformer backend instead of reusing llama3

3e67f2c

HF model without any parallelism now train (but grad_norm is high)

8c5c0ae

a bit cleaner way to get passed args

4ae9560

now same number of params + same attention backend but noticed higher…

9be95f9

… gradnorm and less tps with HF model

fix seed and deterministic

bf91447

fix torch deterministic for HF modeling that was producing Nans

4c2fc0b

HF model now numerically stable compared to TT (given a fixed attenti…

9bffa38

…on backend)

handling the is_hf_initialized flag in patch

40d84cc

refactor HF transformer model args

bd3f332

wrapper model class to avoid transformers to be explicit in train.py

249be92

add better testing script with reference log for later sanity check

e2d4ada

no need to fill passed args

4b498a9

can now handle multiple HF modeling

eb403d5

handle pref logits accessing inside HF model wrapper

a0d67a7

isolate HF patch for llama in another file

ea05552

find hacky way to pass HF model.name through CLI

adefa2c

more granularity of logging when doing parameter breakdown

a235863

add __repr__ to HFTransformerModelArgs for better debugging logs

fc43dc8

HF deepseek v3 is now training

23ae378

refactor to make it clear which args comes from which parts

2573be4

fix refactor and simplify things

46ae0a3

hacky way to switch flavors for now

b33d575

hf deepseek train while matching same param counts as tt deepseek

007f005

3outeille added 7 commits November 13, 2025 10:22

fix ci

5243795

Merge branch 'main' of github.com:huggingface/torchtitan into 3outeil…

84af768

…le/transformers_backend

fix linting

fe691b8

fix head dims in flops counting

5d5ce2b

propose an alternative to passing name

6ace9f4

fix linting

97cd6fe

bump transformers version from 4.55.4 to 4.57.1

5f1695f

3outeille requested review from fegin, tianyu-l, wconstab and wwwjn as code owners November 17, 2025 09:45

3outeille changed the title ~~3outeille/transformers backend~~ 3outeille/transformers backend (Dense model only) Nov 17, 2025

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 17, 2025

3outeille mentioned this pull request Nov 17, 2025

[Experimental Feature] Huggingface model training #919

Open

wwwjn reviewed Nov 17, 2025

View reviewed changes

tianyu-l requested changes Nov 18, 2025

View reviewed changes

3outeille added 12 commits November 18, 2025 10:11

change qwen3 config name

2d2b612

reuse fsdp from llama3. Moe will be handle in another PR

a2ea2ef

clean logging

47fb2ea

move TitanDenseModelArgs to args

20308d3

clean

019f2cc

fix integration tests

fc93b4f

rename integration test file

f9e8e11

update README

83b0437

revert accidental changes linting

fb978dd

typo in naming

71ff098

refactor

663a415

revert the way we select HF modeling in config

3dbe6fa

		)


		def apply_fsdp(

3outeille/transformers backend (Dense model only) #2048

Are you sure you want to change the base?

3outeille/transformers backend (Dense model only) #2048

Uh oh!

Conversation

3outeille commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Usage

Testing methodology

Further tasks

Uh oh!

meta-cla bot commented Nov 17, 2025

Action Required

Process

Uh oh!

meta-cla bot commented Nov 17, 2025

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

3outeille Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

3outeille Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

3outeille commented Nov 17, 2025 •

edited

Loading

3outeille Nov 18, 2025 •

edited

Loading

3outeille Nov 18, 2025 •

edited

Loading