added model definition conversion for llama3 #1441

wesleytruong · 2025-07-22T21:31:06Z

This pr adds a model state dict conversion between TT and HF.

It includes to and from huggingface, and importantly performs a permutation on the q and k attention matrices to address the differences in RoPE implementation between native LLaMA and HuggingFace. Thanks to @rlrs and @vwxyzjn for finding and helping to fix this issue #335, #1291 (comment)

Testing

I tested the correctness of the model conversion by using the two methods greedy decoding, and kl_divergence for thorough comparison.

To test the from_hf script I downloaded a model from HF hub, converted it using the script, and ran forward passes using torchtitan. To test the to_hf script I obtained original llama weights and used the verified llama->dcp script. Then I used the convert_to_hf script to convert these weights to safetensors checkpoint.

For kl_divergence I tested each to_hf and from_hf against the baseline hf model, and compared these to the to_hf and from_hf weights when not performing the permutation.

permuted wq and wk	kl_div (hf->tt)	kl_div (tt->hf)
✅	-3.8356e-15	-1.4431e-14
❌	3.0658e-06	9.6463e-06

When comparing, we can clearly see the kl div loss is many orders of magnitude higher when not permuted, showing that these probability distributions don't accurately represent the baseline hf's probability distribution. However, due to the small amount of weights that need to be permuted in this case, the loss is still not very high in the incorrect case, and can be deceiving if only using this as the evaluation metric. Therefore we also use greedy decoding with long generated sequences to sanity check our results.

Usage

The model conversion can be done in two ways. The first direct way is to use the new convert_from_hf.py or convert_to_hf.py script, but requires loading the entire model weights into cpu memory. The second way is to use the training config options to load/save in hf format during training.

This should bring us one step closer to #1210

tianyu-l

great work! left some initial comments

Please share a summary of all the great tests you did.

Also, this is a great time to revamp https://github.com/pytorch/torchtitan/blob/main/docs/checkpoint.md

tianyu-l · 2025-07-23T00:23:37Z

torchtitan/train.py


        # build model (using meta init)
-        model_args = self.train_spec.model_args[job_config.model.flavor]
+        self.model_args = self.train_spec.model_args[job_config.model.flavor]


This is a hacky way to let CheckpointManager to depend on model_args, indirectly via train_states.

Instead, let's make StateDictAdapter consumes the model_args during init, and change the static methods to instance methods.

tianyu-l · 2025-07-23T00:24:06Z

torchtitan/models/llama3/model/state_dict_adapter.py

+                abstract_key = re.sub(r"(\d+)", "{}", key, count=1)
+                layer_num = re.search(r"\d+", key).group(0)
+                new_key = Llama3StateDictAdapter.from_hf_map[abstract_key]
+                print(f"{new_key} in layer {layer_num}")


remove print

tianyu-l · 2025-07-23T00:24:47Z

torchtitan/models/llama3/model/state_dict_adapter.py

+        "model.norm.weight": "norm.weight",
+        "lm_head.weight": "output.weight",
+    }
+    to_hf_map = {v: k for k, v in from_hf_map.items()}


what happens to "model.layers.{}.self_attn.rotary_emb.inv_freq": None,?

we can do this in the to_hf method, since it's small, similar to https://github.com/pytorch/torchtune/blob/main/torchtune/models/llama4/_convert_weights.py#L223

what happens to "model.layers.{}.self_attn.rotary_emb.inv_freq": None,?

This none mapping is just for reference but effectively does nothing. The RoPE weights in torchtitan will get dropped when mapping to huggingface due to the RoPE implementation differences

tianyu-l · 2025-07-23T00:26:37Z

torchtitan/models/llama3/model/state_dict_adapter.py

+

 class Llama3StateDictAdapter(StateDictAdapter):
+    from_hf_map = {


We can set this as a constant in this file, similar to https://github.com/pytorch/torchtune/blob/main/torchtune/models/llama4/_convert_weights.py#L63
instead of a class variable.

tianyu-l · 2025-07-23T00:45:41Z

scripts/convert_from_hf.py

+            "./assets/tokenizer/Llama-3.1-8B",
+        ]
+    )
+    tokenizer = build_hf_tokenizer(config)


you no longer need to do this after rebasing

tianyu-l · 2025-07-24T00:36:31Z

docs/checkpoint.md

@@ -1,19 +1,9 @@
-## How to convert a Llama 3 checkpoint for use in torchtitan
+# How to use checkpoints in TorchTitan


let's use `torchtitan` (aka torchtitan) within the repo

tianyu-l · 2025-07-24T00:38:04Z

docs/checkpoint.md

+### PyTorch Meta Llama
+
+If you want to continue training from an existing model checkpoint, the checkpoint must be in the DCP format expected by the checkpoint manager.
+An example script for converting the original Llama3 checkpoints into the expected DCP format can be found in `scripts/convert_llama_to_dcp.py`.
+
+The script expects a path to the original checkpoint files, and a path to an output directory:
+```bash
+python -m scripts.convert_from_llama <input_dir> <output_dir>


let's move this to the bottom of this section -- from now on the recommended way would be to/from HF weights

tianyu-l · 2025-07-24T00:38:15Z

docs/checkpoint.md

-python -m torch.distributed.checkpoint.format_utils dcp_to_torch torchtitan/outputs/checkpoint/step-1000 checkpoint.pt
+### PyTorch Meta Llama
+
+If you want to continue training from an existing model checkpoint, the checkpoint must be in the DCP format expected by the checkpoint manager.


this is not true anymore

tianyu-l · 2025-07-24T00:38:48Z

docs/checkpoint.md

-In some cases, you may want to partially load from a previous-trained checkpoint and modify certain settings, such as the number of GPUs or the current step. To achieve this, you can use the `exclude_from_loading` parameter to specify which keys should be excluded from loading.
-This parameter takes a list of string that should be excluded from loading.
+
+### Torchtune


I think we can remove this section in general, as torchtune also supports HF checkpoints.

tianyu-l · 2025-07-24T00:41:00Z

docs/checkpoint.md

+2. SAVE THE FINAL CHECKPOINT\
+Once the above have been set, the final checkpoint at the end of the training step will consist of model only with the desired export dtype. However, if the final step has not been reached yet, full checkpoints will still be saved so that training can be resumed.
+
+3. CONVERT SHARDED CHECKPOINTS TO A SINGLE FILE\


we can keep this part.
Basically in this section you'd have

HF conversion (conversion scripts or during training save/load)

the instruction here to convert DCP to torch (section title should be torch instead of torchtune)

alternative way of converting llama to DCP, which is to be deprecated

tianyu-l · 2025-07-24T00:54:12Z

torchtitan/train.py

            checkpoint_config=job_config.checkpoint,
-            sd_adapter=self.train_spec.state_dict_adapter,
+            sd_adapter=(
+                self.train_spec.state_dict_adapter(self.model_args)


No need to make model_args a class variable

Suggested change

self.train_spec.state_dict_adapter(self.model_args)

self.train_spec.state_dict_adapter(model_args)

could you revert all the changes in this file, except for this section?

tianyu-l · 2025-07-24T00:58:30Z

torchtitan/models/llama3/model/state_dict_adapter.py

+
+        config_json = {
+            "architectures": ["LlamaForCausalLM"],
+            "hidde": "silu",


Suggested change

"hidde": "silu",

"hidden_act": "silu",

tianyu-l · 2025-07-24T01:00:17Z

torchtitan/models/llama3/model/state_dict_adapter.py

+        config_json = {
+            "architectures": ["LlamaForCausalLM"],
+            "hidde": "silu",
+            "hidden_size": self.model_args.dim,
+            "intermediate_size": ffn_hidden_dim,
+            "model_type": "llama",
+            "num_attention_heads": self.model_args.n_heads,
+            "num_hidden_layers": self.model_args.n_layers,
+            "num_key_value_heads": self.model_args.n_kv_heads,
+            "vocab_size": self.model_args.vocab_size,
+        }


This seems only a partial set of https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/config.json
I think the value of such dict is little, compared with the complexity. I suggest we remove this in the output and only keep the hf_state_dict.

tianyu-l · 2025-07-24T01:01:46Z

torchtitan/components/checkpoint.py

+        if to_hf:
+            config_path = Path(checkpoint_id) / "config.json"
+            with config_path.open("w") as f:
+                json.dump(config_json, f, indent=4)
+


In SPMD, multiple processes would run this I think. Not sure what will happen.
But anyways, let's not produce the config.json for the reason I mentioned in the other file.

tianyu-l · 2025-07-24T01:02:21Z

docs/checkpoint.md

@@ -1,19 +1,9 @@
-## How to convert a Llama 3 checkpoint for use in torchtitan
+# How to use checkpoints in TorchTitan


Suggested change

# How to use checkpoints in TorchTitan

# How to use checkpointing in `torchtitan`

tianyu-l

Have you tested correctness of save / load during training? It can be tricky because the value would be DTensors, instead of full plain tensors.
Also, could you test if save/load during training is correct when TP / PP is used?
PP is tricky because some module may not exists in certain PP ranks.

…onfig and configmanager. Additionally it changes the state dict adapter from static class to an instance-type class and consumes the model args in its init to eliminate guesswork during state dict conversion. It also adds support for building the config.json when converting to hf since this file is required by hf for important tasks such as inference. It also moves model_args to a separate file from train_spec to solve a circular import with state_dict_adapter.

…reased overhead and complexity Updates the checkpoint.md

tianyu-l

LGTM! Great work!

@rlrs

## This pr adds a model state dict conversion between TT and HF. It includes to and from huggingface, and importantly performs a permutation on the q and k attention matrices to address the differences in RoPE implementation between native LLaMA and HuggingFace. Thanks to @rlrs and @vwxyzjn for finding and helping to fix this issue pytorch#335, pytorch#1291 (comment) ### Testing I tested the correctness of the model conversion by using the two methods greedy decoding, and kl_divergence for thorough comparison. To test the from_hf script I downloaded a model from HF hub, converted it using the script, and ran forward passes using torchtitan. To test the to_hf script I obtained original llama weights and used the verified llama->dcp script. Then I used the convert_to_hf script to convert these weights to safetensors checkpoint. For kl_divergence I tested each to_hf and from_hf against the baseline hf model, and compared these to the to_hf and from_hf weights when not performing the permutation. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ✅ | -3.8356e-15 | -1.4431e-14 | | ❌ | 3.0658e-06 | 9.6463e-06 | When comparing, we can clearly see the kl div loss is many orders of magnitude higher when not permuted, showing that these probability distributions don't accurately represent the baseline hf's probability distribution. However, due to the small amount of weights that need to be permuted in this case, the loss is still not very high in the incorrect case, and can be deceiving if only using this as the evaluation metric. Therefore we also use greedy decoding with long generated sequences, calculating the loss as the exact match ratio of generated tokens. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ❌ | | | | ✅ | | | ### Usage The model conversion can be done in two ways. The first direct way is to use the new convert_from_hf.py or convert_to_hf.py script, but requires loading the entire model weights into cpu memory. The second way is to use the training config options to load/save in hf format during training. This should bring us one step closer to pytorch#1210

Copied from github.com//pull/1441, tested manually via forge --------- Co-authored-by: Allen Wang <allencwang@fb.com>

@rlrs

## This pr adds a model state dict conversion between TT and HF. It includes to and from huggingface, and importantly performs a permutation on the q and k attention matrices to address the differences in RoPE implementation between native LLaMA and HuggingFace. Thanks to @rlrs and @vwxyzjn for finding and helping to fix this issue pytorch#335, pytorch#1291 (comment) ### Testing I tested the correctness of the model conversion by using the two methods greedy decoding, and kl_divergence for thorough comparison. To test the from_hf script I downloaded a model from HF hub, converted it using the script, and ran forward passes using torchtitan. To test the to_hf script I obtained original llama weights and used the verified llama->dcp script. Then I used the convert_to_hf script to convert these weights to safetensors checkpoint. For kl_divergence I tested each to_hf and from_hf against the baseline hf model, and compared these to the to_hf and from_hf weights when not performing the permutation. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ✅ | -3.8356e-15 | -1.4431e-14 | | ❌ | 3.0658e-06 | 9.6463e-06 | When comparing, we can clearly see the kl div loss is many orders of magnitude higher when not permuted, showing that these probability distributions don't accurately represent the baseline hf's probability distribution. However, due to the small amount of weights that need to be permuted in this case, the loss is still not very high in the incorrect case, and can be deceiving if only using this as the evaluation metric. Therefore we also use greedy decoding with long generated sequences, calculating the loss as the exact match ratio of generated tokens. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ❌ | | | | ✅ | | | ### Usage The model conversion can be done in two ways. The first direct way is to use the new convert_from_hf.py or convert_to_hf.py script, but requires loading the entire model weights into cpu memory. The second way is to use the training config options to load/save in hf format during training. This should bring us one step closer to pytorch#1210

Copied from github.com/pytorch/pull/1441, tested manually via forge --------- Co-authored-by: Allen Wang <allencwang@fb.com>

@rlrs

## This pr adds a model state dict conversion between TT and HF. It includes to and from huggingface, and importantly performs a permutation on the q and k attention matrices to address the differences in RoPE implementation between native LLaMA and HuggingFace. Thanks to @rlrs and @vwxyzjn for finding and helping to fix this issue pytorch#335, pytorch#1291 (comment) ### Testing I tested the correctness of the model conversion by using the two methods greedy decoding, and kl_divergence for thorough comparison. To test the from_hf script I downloaded a model from HF hub, converted it using the script, and ran forward passes using torchtitan. To test the to_hf script I obtained original llama weights and used the verified llama->dcp script. Then I used the convert_to_hf script to convert these weights to safetensors checkpoint. For kl_divergence I tested each to_hf and from_hf against the baseline hf model, and compared these to the to_hf and from_hf weights when not performing the permutation. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ✅ | -3.8356e-15 | -1.4431e-14 | | ❌ | 3.0658e-06 | 9.6463e-06 | When comparing, we can clearly see the kl div loss is many orders of magnitude higher when not permuted, showing that these probability distributions don't accurately represent the baseline hf's probability distribution. However, due to the small amount of weights that need to be permuted in this case, the loss is still not very high in the incorrect case, and can be deceiving if only using this as the evaluation metric. Therefore we also use greedy decoding with long generated sequences, calculating the loss as the exact match ratio of generated tokens. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ❌ | | | | ✅ | | | ### Usage The model conversion can be done in two ways. The first direct way is to use the new convert_from_hf.py or convert_to_hf.py script, but requires loading the entire model weights into cpu memory. The second way is to use the training config options to load/save in hf format during training. This should bring us one step closer to pytorch#1210

Copied from github.com/pytorch/pull/1441, tested manually via forge --------- Co-authored-by: Allen Wang <allencwang@fb.com>

@rlrs

## This pr adds a model state dict conversion between TT and HF. It includes to and from huggingface, and importantly performs a permutation on the q and k attention matrices to address the differences in RoPE implementation between native LLaMA and HuggingFace. Thanks to @rlrs and @vwxyzjn for finding and helping to fix this issue pytorch#335, pytorch#1291 (comment) ### Testing I tested the correctness of the model conversion by using the two methods greedy decoding, and kl_divergence for thorough comparison. To test the from_hf script I downloaded a model from HF hub, converted it using the script, and ran forward passes using torchtitan. To test the to_hf script I obtained original llama weights and used the verified llama->dcp script. Then I used the convert_to_hf script to convert these weights to safetensors checkpoint. For kl_divergence I tested each to_hf and from_hf against the baseline hf model, and compared these to the to_hf and from_hf weights when not performing the permutation. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ✅ | -3.8356e-15 | -1.4431e-14 | | ❌ | 3.0658e-06 | 9.6463e-06 | When comparing, we can clearly see the kl div loss is many orders of magnitude higher when not permuted, showing that these probability distributions don't accurately represent the baseline hf's probability distribution. However, due to the small amount of weights that need to be permuted in this case, the loss is still not very high in the incorrect case, and can be deceiving if only using this as the evaluation metric. Therefore we also use greedy decoding with long generated sequences, calculating the loss as the exact match ratio of generated tokens. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ❌ | | | | ✅ | | | ### Usage The model conversion can be done in two ways. The first direct way is to use the new convert_from_hf.py or convert_to_hf.py script, but requires loading the entire model weights into cpu memory. The second way is to use the training config options to load/save in hf format during training. This should bring us one step closer to pytorch#1210

Copied from github.com/pytorch/pull/1441, tested manually via forge --------- Co-authored-by: Allen Wang <allencwang@fb.com>

wesleytruong requested review from fegin, tianyu-l, wconstab and wwwjn as code owners July 22, 2025 21:31

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 22, 2025

tianyu-l mentioned this pull request Jul 22, 2025

How to adapt HuggingFace or other models for TorchTitan #1322

Open

wesleytruong force-pushed the llama_model_def_conversion branch from 7c0c955 to 9500978 Compare July 22, 2025 22:39

tianyu-l reviewed Jul 23, 2025

View reviewed changes

ebsmothers mentioned this pull request Jul 23, 2025

A couple small fixes meta-pytorch/torchforge#18

Merged

wesleytruong force-pushed the llama_model_def_conversion branch 2 times, most recently from fb1ef04 to 5765d19 Compare July 23, 2025 23:41

tianyu-l reviewed Jul 24, 2025

View reviewed changes

wesleytruong added 4 commits July 24, 2025 01:44

added model definition conversion for llama3

c6d44eb

fixing sd_adapter should be assigned even if None

f0938d8

refactored model_args and model protocol location

c844a20

wesleytruong force-pushed the llama_model_def_conversion branch from fb05185 to e7b98c9 Compare July 24, 2025 08:50

wesleytruong changed the title ~~added model definition converison for llama3~~ added model definition conversis\on for llama3 Jul 24, 2025

wesleytruong changed the title ~~added model definition conversis\on for llama3~~ added model definition conversion for llama3 Jul 24, 2025

wesleytruong force-pushed the llama_model_def_conversion branch from e7b98c9 to 63c3fc5 Compare July 24, 2025 08:52

Removes support for huggingface config.json due to the unecessary inc…

8490d99

…reased overhead and complexity Updates the checkpoint.md

wesleytruong force-pushed the llama_model_def_conversion branch from 63c3fc5 to 8490d99 Compare July 24, 2025 09:56

tianyu-l mentioned this pull request Jul 24, 2025

publish instructions on adding a new model #1451

Merged

tianyu-l approved these changes Jul 24, 2025

View reviewed changes

tianyu-l merged commit 70592cb into main Jul 24, 2025
8 of 9 checks passed

tianyu-l deleted the llama_model_def_conversion branch July 24, 2025 22:51

wwwjn mentioned this pull request Jul 25, 2025

[DSV3] Add dsv3 convert script from HF weights to torchtitan weights #1456

Closed

allenwang28 mentioned this pull request Jul 29, 2025

Fixes the sd adapter in forge experiments #1484

Merged

allenwang28 added a commit that referenced this pull request Jul 29, 2025

Fixes the sd adapter in forge experiments (#1484)

327a99c

Copied from github.com//pull/1441, tested manually via forge --------- Co-authored-by: Allen Wang <allencwang@fb.com>

jquesnelle mentioned this pull request Aug 5, 2025

torchtitan PsycheFoundation/psyche#201

Open

bentherien pushed a commit to bentherien/torchtitan_ that referenced this pull request Aug 5, 2025

Fixes the sd adapter in forge experiments (pytorch#1484)

a058085

Copied from github.com/pytorch/pull/1441, tested manually via forge --------- Co-authored-by: Allen Wang <allencwang@fb.com>

joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025

Fixes the sd adapter in forge experiments (pytorch#1484)

c97275b

Copied from github.com/pytorch/pull/1441, tested manually via forge --------- Co-authored-by: Allen Wang <allencwang@fb.com>

joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025

Fixes the sd adapter in forge experiments (pytorch#1484)

268f455

Copied from github.com/pytorch/pull/1441, tested manually via forge --------- Co-authored-by: Allen Wang <allencwang@fb.com>

wwwjn mentioned this pull request Aug 13, 2025

Adding Qwen3 model to the experiments folder #1429

Merged



		class Llama3StateDictAdapter(StateDictAdapter):
		from_hf_map = {

		@@ -1,19 +1,9 @@
		## How to convert a Llama 3 checkpoint for use in torchtitan
		# How to use checkpoints in TorchTitan

	self.train_spec.state_dict_adapter(self.model_args)
	self.train_spec.state_dict_adapter(model_args)

	# How to use checkpoints in TorchTitan
	# How to use checkpointing in `torchtitan`

added model definition conversion for llama3 #1441

added model definition conversion for llama3 #1441

Uh oh!

Conversation

wesleytruong commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This pr adds a model state dict conversion between TT and HF.

Testing

Usage

Uh oh!

tianyu-l left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wesleytruong commented Jul 22, 2025 •

edited

Loading

tianyu-l left a comment •

edited

Loading