Skip to content

Conversation

@wesleytruong
Copy link
Contributor

@wesleytruong wesleytruong commented Jul 22, 2025

This pr adds a model state dict conversion between TT and HF.

It includes to and from huggingface, and importantly performs a permutation on the q and k attention matrices to address the differences in RoPE implementation between native LLaMA and HuggingFace. Thanks to @rlrs and @vwxyzjn for finding and helping to fix this issue #335, #1291 (comment)

Testing

I tested the correctness of the model conversion by using the two methods greedy decoding, and kl_divergence for thorough comparison.

To test the from_hf script I downloaded a model from HF hub, converted it using the script, and ran forward passes using torchtitan. To test the to_hf script I obtained original llama weights and used the verified llama->dcp script. Then I used the convert_to_hf script to convert these weights to safetensors checkpoint.

For kl_divergence I tested each to_hf and from_hf against the baseline hf model, and compared these to the to_hf and from_hf weights when not performing the permutation.

permuted wq and wk kl_div (hf->tt) kl_div (tt->hf)
-3.8356e-15 -1.4431e-14
3.0658e-06 9.6463e-06

When comparing, we can clearly see the kl div loss is many orders of magnitude higher when not permuted, showing that these probability distributions don't accurately represent the baseline hf's probability distribution. However, due to the small amount of weights that need to be permuted in this case, the loss is still not very high in the incorrect case, and can be deceiving if only using this as the evaluation metric. Therefore we also use greedy decoding with long generated sequences to sanity check our results.

Usage

The model conversion can be done in two ways. The first direct way is to use the new convert_from_hf.py or convert_to_hf.py script, but requires loading the entire model weights into cpu memory. The second way is to use the training config options to load/save in hf format during training.

This should bring us one step closer to #1210

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 22, 2025
@wesleytruong wesleytruong force-pushed the llama_model_def_conversion branch from 7c0c955 to 9500978 Compare July 22, 2025 22:39
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work! left some initial comments

Please share a summary of all the great tests you did.

Also, this is a great time to revamp https://github.com/pytorch/torchtitan/blob/main/docs/checkpoint.md


# build model (using meta init)
model_args = self.train_spec.model_args[job_config.model.flavor]
self.model_args = self.train_spec.model_args[job_config.model.flavor]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hacky way to let CheckpointManager to depend on model_args, indirectly via train_states.

Instead, let's make StateDictAdapter consumes the model_args during init, and change the static methods to instance methods.

abstract_key = re.sub(r"(\d+)", "{}", key, count=1)
layer_num = re.search(r"\d+", key).group(0)
new_key = Llama3StateDictAdapter.from_hf_map[abstract_key]
print(f"{new_key} in layer {layer_num}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove print

"model.norm.weight": "norm.weight",
"lm_head.weight": "output.weight",
}
to_hf_map = {v: k for k, v in from_hf_map.items()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens to "model.layers.{}.self_attn.rotary_emb.inv_freq": None,?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens to "model.layers.{}.self_attn.rotary_emb.inv_freq": None,?

This none mapping is just for reference but effectively does nothing. The RoPE weights in torchtitan will get dropped when mapping to huggingface due to the RoPE implementation differences



class Llama3StateDictAdapter(StateDictAdapter):
from_hf_map = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can set this as a constant in this file, similar to https://github.com/pytorch/torchtune/blob/main/torchtune/models/llama4/_convert_weights.py#L63
instead of a class variable.

"./assets/tokenizer/Llama-3.1-8B",
]
)
tokenizer = build_hf_tokenizer(config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you no longer need to do this after rebasing

@wesleytruong wesleytruong force-pushed the llama_model_def_conversion branch 2 times, most recently from fb1ef04 to 5765d19 Compare July 23, 2025 23:41
@@ -1,19 +1,9 @@
## How to convert a Llama 3 checkpoint for use in torchtitan
# How to use checkpoints in TorchTitan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use `torchtitan` (aka torchtitan) within the repo

Comment on lines 57 to 64
### PyTorch Meta Llama

If you want to continue training from an existing model checkpoint, the checkpoint must be in the DCP format expected by the checkpoint manager.
An example script for converting the original Llama3 checkpoints into the expected DCP format can be found in `scripts/convert_llama_to_dcp.py`.

The script expects a path to the original checkpoint files, and a path to an output directory:
```bash
python -m scripts.convert_from_llama <input_dir> <output_dir>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this to the bottom of this section -- from now on the recommended way would be to/from HF weights

python -m torch.distributed.checkpoint.format_utils dcp_to_torch torchtitan/outputs/checkpoint/step-1000 checkpoint.pt
### PyTorch Meta Llama

If you want to continue training from an existing model checkpoint, the checkpoint must be in the DCP format expected by the checkpoint manager.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not true anymore

In some cases, you may want to partially load from a previous-trained checkpoint and modify certain settings, such as the number of GPUs or the current step. To achieve this, you can use the `exclude_from_loading` parameter to specify which keys should be excluded from loading.
This parameter takes a list of string that should be excluded from loading.

### Torchtune
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this section in general, as torchtune also supports HF checkpoints.

2. SAVE THE FINAL CHECKPOINT\
Once the above have been set, the final checkpoint at the end of the training step will consist of model only with the desired export dtype. However, if the final step has not been reached yet, full checkpoints will still be saved so that training can be resumed.

3. CONVERT SHARDED CHECKPOINTS TO A SINGLE FILE\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep this part.
Basically in this section you'd have

  1. HF conversion (conversion scripts or during training save/load)
  2. the instruction here to convert DCP to torch (section title should be torch instead of torchtune)
  3. alternative way of converting llama to DCP, which is to be deprecated

checkpoint_config=job_config.checkpoint,
sd_adapter=self.train_spec.state_dict_adapter,
sd_adapter=(
self.train_spec.state_dict_adapter(self.model_args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to make model_args a class variable

Suggested change
self.train_spec.state_dict_adapter(self.model_args)
self.train_spec.state_dict_adapter(model_args)

could you revert all the changes in this file, except for this section?


config_json = {
"architectures": ["LlamaForCausalLM"],
"hidde": "silu",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"hidde": "silu",
"hidden_act": "silu",

Comment on lines 102 to 112
config_json = {
"architectures": ["LlamaForCausalLM"],
"hidde": "silu",
"hidden_size": self.model_args.dim,
"intermediate_size": ffn_hidden_dim,
"model_type": "llama",
"num_attention_heads": self.model_args.n_heads,
"num_hidden_layers": self.model_args.n_layers,
"num_key_value_heads": self.model_args.n_kv_heads,
"vocab_size": self.model_args.vocab_size,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems only a partial set of https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/config.json
I think the value of such dict is little, compared with the complexity. I suggest we remove this in the output and only keep the hf_state_dict.

Comment on lines 408 to 412
if to_hf:
config_path = Path(checkpoint_id) / "config.json"
with config_path.open("w") as f:
json.dump(config_json, f, indent=4)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SPMD, multiple processes would run this I think. Not sure what will happen.
But anyways, let's not produce the config.json for the reason I mentioned in the other file.

@@ -1,19 +1,9 @@
## How to convert a Llama 3 checkpoint for use in torchtitan
# How to use checkpoints in TorchTitan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# How to use checkpoints in TorchTitan
# How to use checkpointing in `torchtitan`

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested correctness of save / load during training? It can be tricky because the value would be DTensors, instead of full plain tensors.
Also, could you test if save/load during training is correct when TP / PP is used?
PP is tricky because some module may not exists in certain PP ranks.

…onfig and configmanager.

Additionally it changes the state dict adapter from static class to an instance-type class and consumes the model args in its init to eliminate guesswork during state dict conversion.

It also adds support for building the config.json when converting to hf since this file is required by hf for important tasks such as inference.

It also moves model_args to a separate file from train_spec to solve a circular import with state_dict_adapter.
@wesleytruong wesleytruong force-pushed the llama_model_def_conversion branch from fb05185 to e7b98c9 Compare July 24, 2025 08:50
@wesleytruong wesleytruong changed the title added model definition converison for llama3 added model definition conversis\on for llama3 Jul 24, 2025
@wesleytruong wesleytruong changed the title added model definition conversis\on for llama3 added model definition conversion for llama3 Jul 24, 2025
@wesleytruong wesleytruong force-pushed the llama_model_def_conversion branch from e7b98c9 to 63c3fc5 Compare July 24, 2025 08:52
…reased overhead and complexity

Updates the checkpoint.md
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work!

@tianyu-l tianyu-l merged commit 70592cb into main Jul 24, 2025
8 of 9 checks passed
@tianyu-l tianyu-l deleted the llama_model_def_conversion branch July 24, 2025 22:51
idoh pushed a commit to idoh/torchtitan that referenced this pull request Jul 28, 2025
## This pr adds a model state dict conversion between TT and HF.

It includes to and from huggingface, and importantly performs a
permutation on the q and k attention matrices to address the differences
in RoPE implementation between native LLaMA and HuggingFace. Thanks to
@rlrs and @vwxyzjn for finding and helping to fix this issue pytorch#335,
pytorch#1291 (comment)

### Testing
I tested the correctness of the model conversion by using the two
methods greedy decoding, and kl_divergence for thorough comparison.

To test the from_hf script I downloaded a model from HF hub, converted
it using the script, and ran forward passes using torchtitan. To test
the to_hf script I obtained original llama weights and used the verified
llama->dcp script. Then I used the convert_to_hf script to convert these
weights to safetensors checkpoint.

For kl_divergence I tested each to_hf and from_hf against the baseline
hf model, and compared these to the to_hf and from_hf weights when not
performing the permutation.
| permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) |
| --- | --- | --- |
| ✅   | -3.8356e-15 | -1.4431e-14 |
| ❌   | 3.0658e-06 | 9.6463e-06 |

When comparing, we can clearly see the kl div loss is many orders of
magnitude higher when not permuted, showing that these probability
distributions don't accurately represent the baseline hf's probability
distribution. However, due to the small amount of weights that need to
be permuted in this case, the loss is still not very high in the
incorrect case, and can be deceiving if only using this as the
evaluation metric. Therefore we also use greedy decoding with long
generated sequences, calculating the loss as the exact match ratio of
generated tokens.
| permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) |
| --- | --- | --- |
| ❌  |  |  |
| ✅  |  |  |

### Usage
The model conversion can be done in two ways. The first direct way is to
use the new convert_from_hf.py or convert_to_hf.py script, but requires
loading the entire model weights into cpu memory. The second way is to
use the training config options to load/save in hf format during
training.

This should bring us one step closer to
pytorch#1210
allenwang28 added a commit that referenced this pull request Jul 29, 2025
Copied from github.com//pull/1441, tested manually via
forge

---------

Co-authored-by: Allen Wang <allencwang@fb.com>
bentherien pushed a commit to bentherien/torchtitan_ that referenced this pull request Aug 5, 2025
## This pr adds a model state dict conversion between TT and HF.

It includes to and from huggingface, and importantly performs a
permutation on the q and k attention matrices to address the differences
in RoPE implementation between native LLaMA and HuggingFace. Thanks to
@rlrs and @vwxyzjn for finding and helping to fix this issue pytorch#335,
pytorch#1291 (comment)

### Testing
I tested the correctness of the model conversion by using the two
methods greedy decoding, and kl_divergence for thorough comparison.

To test the from_hf script I downloaded a model from HF hub, converted
it using the script, and ran forward passes using torchtitan. To test
the to_hf script I obtained original llama weights and used the verified
llama->dcp script. Then I used the convert_to_hf script to convert these
weights to safetensors checkpoint.

For kl_divergence I tested each to_hf and from_hf against the baseline
hf model, and compared these to the to_hf and from_hf weights when not
performing the permutation.
| permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) |
| --- | --- | --- |
| ✅   | -3.8356e-15 | -1.4431e-14 |
| ❌   | 3.0658e-06 | 9.6463e-06 |

When comparing, we can clearly see the kl div loss is many orders of
magnitude higher when not permuted, showing that these probability
distributions don't accurately represent the baseline hf's probability
distribution. However, due to the small amount of weights that need to
be permuted in this case, the loss is still not very high in the
incorrect case, and can be deceiving if only using this as the
evaluation metric. Therefore we also use greedy decoding with long
generated sequences, calculating the loss as the exact match ratio of
generated tokens.
| permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) |
| --- | --- | --- |
| ❌  |  |  |
| ✅  |  |  |

### Usage
The model conversion can be done in two ways. The first direct way is to
use the new convert_from_hf.py or convert_to_hf.py script, but requires
loading the entire model weights into cpu memory. The second way is to
use the training config options to load/save in hf format during
training.

This should bring us one step closer to
pytorch#1210
bentherien pushed a commit to bentherien/torchtitan_ that referenced this pull request Aug 5, 2025
Copied from github.com/pytorch/pull/1441, tested manually via
forge

---------

Co-authored-by: Allen Wang <allencwang@fb.com>
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
## This pr adds a model state dict conversion between TT and HF.

It includes to and from huggingface, and importantly performs a
permutation on the q and k attention matrices to address the differences
in RoPE implementation between native LLaMA and HuggingFace. Thanks to
@rlrs and @vwxyzjn for finding and helping to fix this issue pytorch#335,
pytorch#1291 (comment)

### Testing
I tested the correctness of the model conversion by using the two
methods greedy decoding, and kl_divergence for thorough comparison.

To test the from_hf script I downloaded a model from HF hub, converted
it using the script, and ran forward passes using torchtitan. To test
the to_hf script I obtained original llama weights and used the verified
llama->dcp script. Then I used the convert_to_hf script to convert these
weights to safetensors checkpoint.

For kl_divergence I tested each to_hf and from_hf against the baseline
hf model, and compared these to the to_hf and from_hf weights when not
performing the permutation.
| permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) |
| --- | --- | --- |
| ✅   | -3.8356e-15 | -1.4431e-14 |
| ❌   | 3.0658e-06 | 9.6463e-06 |

When comparing, we can clearly see the kl div loss is many orders of
magnitude higher when not permuted, showing that these probability
distributions don't accurately represent the baseline hf's probability
distribution. However, due to the small amount of weights that need to
be permuted in this case, the loss is still not very high in the
incorrect case, and can be deceiving if only using this as the
evaluation metric. Therefore we also use greedy decoding with long
generated sequences, calculating the loss as the exact match ratio of
generated tokens.
| permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) |
| --- | --- | --- |
| ❌  |  |  |
| ✅  |  |  |

### Usage
The model conversion can be done in two ways. The first direct way is to
use the new convert_from_hf.py or convert_to_hf.py script, but requires
loading the entire model weights into cpu memory. The second way is to
use the training config options to load/save in hf format during
training.

This should bring us one step closer to
pytorch#1210
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
Copied from github.com/pytorch/pull/1441, tested manually via
forge

---------

Co-authored-by: Allen Wang <allencwang@fb.com>
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
## This pr adds a model state dict conversion between TT and HF.

It includes to and from huggingface, and importantly performs a
permutation on the q and k attention matrices to address the differences
in RoPE implementation between native LLaMA and HuggingFace. Thanks to
@rlrs and @vwxyzjn for finding and helping to fix this issue pytorch#335,
pytorch#1291 (comment)

### Testing
I tested the correctness of the model conversion by using the two
methods greedy decoding, and kl_divergence for thorough comparison.

To test the from_hf script I downloaded a model from HF hub, converted
it using the script, and ran forward passes using torchtitan. To test
the to_hf script I obtained original llama weights and used the verified
llama->dcp script. Then I used the convert_to_hf script to convert these
weights to safetensors checkpoint.

For kl_divergence I tested each to_hf and from_hf against the baseline
hf model, and compared these to the to_hf and from_hf weights when not
performing the permutation.
| permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) |
| --- | --- | --- |
| ✅   | -3.8356e-15 | -1.4431e-14 |
| ❌   | 3.0658e-06 | 9.6463e-06 |

When comparing, we can clearly see the kl div loss is many orders of
magnitude higher when not permuted, showing that these probability
distributions don't accurately represent the baseline hf's probability
distribution. However, due to the small amount of weights that need to
be permuted in this case, the loss is still not very high in the
incorrect case, and can be deceiving if only using this as the
evaluation metric. Therefore we also use greedy decoding with long
generated sequences, calculating the loss as the exact match ratio of
generated tokens.
| permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) |
| --- | --- | --- |
| ❌  |  |  |
| ✅  |  |  |

### Usage
The model conversion can be done in two ways. The first direct way is to
use the new convert_from_hf.py or convert_to_hf.py script, but requires
loading the entire model weights into cpu memory. The second way is to
use the training config options to load/save in hf format during
training.

This should bring us one step closer to
pytorch#1210
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
Copied from github.com/pytorch/pull/1441, tested manually via
forge

---------

Co-authored-by: Allen Wang <allencwang@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants