-
Notifications
You must be signed in to change notification settings - Fork 603
added model definition conversion for llama3 #1441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7c0c955 to
9500978
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work! left some initial comments
Please share a summary of all the great tests you did.
Also, this is a great time to revamp https://github.com/pytorch/torchtitan/blob/main/docs/checkpoint.md
torchtitan/train.py
Outdated
|
|
||
| # build model (using meta init) | ||
| model_args = self.train_spec.model_args[job_config.model.flavor] | ||
| self.model_args = self.train_spec.model_args[job_config.model.flavor] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hacky way to let CheckpointManager to depend on model_args, indirectly via train_states.
Instead, let's make StateDictAdapter consumes the model_args during init, and change the static methods to instance methods.
| abstract_key = re.sub(r"(\d+)", "{}", key, count=1) | ||
| layer_num = re.search(r"\d+", key).group(0) | ||
| new_key = Llama3StateDictAdapter.from_hf_map[abstract_key] | ||
| print(f"{new_key} in layer {layer_num}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove print
| "model.norm.weight": "norm.weight", | ||
| "lm_head.weight": "output.weight", | ||
| } | ||
| to_hf_map = {v: k for k, v in from_hf_map.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens to "model.layers.{}.self_attn.rotary_emb.inv_freq": None,?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can do this in the to_hf method, since it's small, similar to https://github.com/pytorch/torchtune/blob/main/torchtune/models/llama4/_convert_weights.py#L223
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens to
"model.layers.{}.self_attn.rotary_emb.inv_freq": None,?
This none mapping is just for reference but effectively does nothing. The RoPE weights in torchtitan will get dropped when mapping to huggingface due to the RoPE implementation differences
|
|
||
|
|
||
| class Llama3StateDictAdapter(StateDictAdapter): | ||
| from_hf_map = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can set this as a constant in this file, similar to https://github.com/pytorch/torchtune/blob/main/torchtune/models/llama4/_convert_weights.py#L63
instead of a class variable.
scripts/convert_from_hf.py
Outdated
| "./assets/tokenizer/Llama-3.1-8B", | ||
| ] | ||
| ) | ||
| tokenizer = build_hf_tokenizer(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you no longer need to do this after rebasing
fb1ef04 to
5765d19
Compare
docs/checkpoint.md
Outdated
| @@ -1,19 +1,9 @@ | |||
| ## How to convert a Llama 3 checkpoint for use in torchtitan | |||
| # How to use checkpoints in TorchTitan | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use `torchtitan` (aka torchtitan) within the repo
docs/checkpoint.md
Outdated
| ### PyTorch Meta Llama | ||
|
|
||
| If you want to continue training from an existing model checkpoint, the checkpoint must be in the DCP format expected by the checkpoint manager. | ||
| An example script for converting the original Llama3 checkpoints into the expected DCP format can be found in `scripts/convert_llama_to_dcp.py`. | ||
|
|
||
| The script expects a path to the original checkpoint files, and a path to an output directory: | ||
| ```bash | ||
| python -m scripts.convert_from_llama <input_dir> <output_dir> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this to the bottom of this section -- from now on the recommended way would be to/from HF weights
docs/checkpoint.md
Outdated
| python -m torch.distributed.checkpoint.format_utils dcp_to_torch torchtitan/outputs/checkpoint/step-1000 checkpoint.pt | ||
| ### PyTorch Meta Llama | ||
|
|
||
| If you want to continue training from an existing model checkpoint, the checkpoint must be in the DCP format expected by the checkpoint manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not true anymore
docs/checkpoint.md
Outdated
| In some cases, you may want to partially load from a previous-trained checkpoint and modify certain settings, such as the number of GPUs or the current step. To achieve this, you can use the `exclude_from_loading` parameter to specify which keys should be excluded from loading. | ||
| This parameter takes a list of string that should be excluded from loading. | ||
|
|
||
| ### Torchtune |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this section in general, as torchtune also supports HF checkpoints.
| 2. SAVE THE FINAL CHECKPOINT\ | ||
| Once the above have been set, the final checkpoint at the end of the training step will consist of model only with the desired export dtype. However, if the final step has not been reached yet, full checkpoints will still be saved so that training can be resumed. | ||
|
|
||
| 3. CONVERT SHARDED CHECKPOINTS TO A SINGLE FILE\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can keep this part.
Basically in this section you'd have
- HF conversion (conversion scripts or during training save/load)
- the instruction here to convert DCP to torch (section title should be torch instead of torchtune)
- alternative way of converting llama to DCP, which is to be deprecated
torchtitan/train.py
Outdated
| checkpoint_config=job_config.checkpoint, | ||
| sd_adapter=self.train_spec.state_dict_adapter, | ||
| sd_adapter=( | ||
| self.train_spec.state_dict_adapter(self.model_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to make model_args a class variable
| self.train_spec.state_dict_adapter(self.model_args) | |
| self.train_spec.state_dict_adapter(model_args) |
could you revert all the changes in this file, except for this section?
|
|
||
| config_json = { | ||
| "architectures": ["LlamaForCausalLM"], | ||
| "hidde": "silu", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "hidde": "silu", | |
| "hidden_act": "silu", |
| config_json = { | ||
| "architectures": ["LlamaForCausalLM"], | ||
| "hidde": "silu", | ||
| "hidden_size": self.model_args.dim, | ||
| "intermediate_size": ffn_hidden_dim, | ||
| "model_type": "llama", | ||
| "num_attention_heads": self.model_args.n_heads, | ||
| "num_hidden_layers": self.model_args.n_layers, | ||
| "num_key_value_heads": self.model_args.n_kv_heads, | ||
| "vocab_size": self.model_args.vocab_size, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems only a partial set of https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/config.json
I think the value of such dict is little, compared with the complexity. I suggest we remove this in the output and only keep the hf_state_dict.
torchtitan/components/checkpoint.py
Outdated
| if to_hf: | ||
| config_path = Path(checkpoint_id) / "config.json" | ||
| with config_path.open("w") as f: | ||
| json.dump(config_json, f, indent=4) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In SPMD, multiple processes would run this I think. Not sure what will happen.
But anyways, let's not produce the config.json for the reason I mentioned in the other file.
docs/checkpoint.md
Outdated
| @@ -1,19 +1,9 @@ | |||
| ## How to convert a Llama 3 checkpoint for use in torchtitan | |||
| # How to use checkpoints in TorchTitan | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # How to use checkpoints in TorchTitan | |
| # How to use checkpointing in `torchtitan` |
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested correctness of save / load during training? It can be tricky because the value would be DTensors, instead of full plain tensors.
Also, could you test if save/load during training is correct when TP / PP is used?
PP is tricky because some module may not exists in certain PP ranks.
…onfig and configmanager. Additionally it changes the state dict adapter from static class to an instance-type class and consumes the model args in its init to eliminate guesswork during state dict conversion. It also adds support for building the config.json when converting to hf since this file is required by hf for important tasks such as inference. It also moves model_args to a separate file from train_spec to solve a circular import with state_dict_adapter.
fb05185 to
e7b98c9
Compare
e7b98c9 to
63c3fc5
Compare
…reased overhead and complexity Updates the checkpoint.md
63c3fc5 to
8490d99
Compare
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work!
## This pr adds a model state dict conversion between TT and HF. It includes to and from huggingface, and importantly performs a permutation on the q and k attention matrices to address the differences in RoPE implementation between native LLaMA and HuggingFace. Thanks to @rlrs and @vwxyzjn for finding and helping to fix this issue pytorch#335, pytorch#1291 (comment) ### Testing I tested the correctness of the model conversion by using the two methods greedy decoding, and kl_divergence for thorough comparison. To test the from_hf script I downloaded a model from HF hub, converted it using the script, and ran forward passes using torchtitan. To test the to_hf script I obtained original llama weights and used the verified llama->dcp script. Then I used the convert_to_hf script to convert these weights to safetensors checkpoint. For kl_divergence I tested each to_hf and from_hf against the baseline hf model, and compared these to the to_hf and from_hf weights when not performing the permutation. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ✅ | -3.8356e-15 | -1.4431e-14 | | ❌ | 3.0658e-06 | 9.6463e-06 | When comparing, we can clearly see the kl div loss is many orders of magnitude higher when not permuted, showing that these probability distributions don't accurately represent the baseline hf's probability distribution. However, due to the small amount of weights that need to be permuted in this case, the loss is still not very high in the incorrect case, and can be deceiving if only using this as the evaluation metric. Therefore we also use greedy decoding with long generated sequences, calculating the loss as the exact match ratio of generated tokens. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ❌ | | | | ✅ | | | ### Usage The model conversion can be done in two ways. The first direct way is to use the new convert_from_hf.py or convert_to_hf.py script, but requires loading the entire model weights into cpu memory. The second way is to use the training config options to load/save in hf format during training. This should bring us one step closer to pytorch#1210
Copied from github.com//pull/1441, tested manually via forge --------- Co-authored-by: Allen Wang <allencwang@fb.com>
## This pr adds a model state dict conversion between TT and HF. It includes to and from huggingface, and importantly performs a permutation on the q and k attention matrices to address the differences in RoPE implementation between native LLaMA and HuggingFace. Thanks to @rlrs and @vwxyzjn for finding and helping to fix this issue pytorch#335, pytorch#1291 (comment) ### Testing I tested the correctness of the model conversion by using the two methods greedy decoding, and kl_divergence for thorough comparison. To test the from_hf script I downloaded a model from HF hub, converted it using the script, and ran forward passes using torchtitan. To test the to_hf script I obtained original llama weights and used the verified llama->dcp script. Then I used the convert_to_hf script to convert these weights to safetensors checkpoint. For kl_divergence I tested each to_hf and from_hf against the baseline hf model, and compared these to the to_hf and from_hf weights when not performing the permutation. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ✅ | -3.8356e-15 | -1.4431e-14 | | ❌ | 3.0658e-06 | 9.6463e-06 | When comparing, we can clearly see the kl div loss is many orders of magnitude higher when not permuted, showing that these probability distributions don't accurately represent the baseline hf's probability distribution. However, due to the small amount of weights that need to be permuted in this case, the loss is still not very high in the incorrect case, and can be deceiving if only using this as the evaluation metric. Therefore we also use greedy decoding with long generated sequences, calculating the loss as the exact match ratio of generated tokens. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ❌ | | | | ✅ | | | ### Usage The model conversion can be done in two ways. The first direct way is to use the new convert_from_hf.py or convert_to_hf.py script, but requires loading the entire model weights into cpu memory. The second way is to use the training config options to load/save in hf format during training. This should bring us one step closer to pytorch#1210
Copied from github.com/pytorch/pull/1441, tested manually via forge --------- Co-authored-by: Allen Wang <allencwang@fb.com>
## This pr adds a model state dict conversion between TT and HF. It includes to and from huggingface, and importantly performs a permutation on the q and k attention matrices to address the differences in RoPE implementation between native LLaMA and HuggingFace. Thanks to @rlrs and @vwxyzjn for finding and helping to fix this issue pytorch#335, pytorch#1291 (comment) ### Testing I tested the correctness of the model conversion by using the two methods greedy decoding, and kl_divergence for thorough comparison. To test the from_hf script I downloaded a model from HF hub, converted it using the script, and ran forward passes using torchtitan. To test the to_hf script I obtained original llama weights and used the verified llama->dcp script. Then I used the convert_to_hf script to convert these weights to safetensors checkpoint. For kl_divergence I tested each to_hf and from_hf against the baseline hf model, and compared these to the to_hf and from_hf weights when not performing the permutation. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ✅ | -3.8356e-15 | -1.4431e-14 | | ❌ | 3.0658e-06 | 9.6463e-06 | When comparing, we can clearly see the kl div loss is many orders of magnitude higher when not permuted, showing that these probability distributions don't accurately represent the baseline hf's probability distribution. However, due to the small amount of weights that need to be permuted in this case, the loss is still not very high in the incorrect case, and can be deceiving if only using this as the evaluation metric. Therefore we also use greedy decoding with long generated sequences, calculating the loss as the exact match ratio of generated tokens. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ❌ | | | | ✅ | | | ### Usage The model conversion can be done in two ways. The first direct way is to use the new convert_from_hf.py or convert_to_hf.py script, but requires loading the entire model weights into cpu memory. The second way is to use the training config options to load/save in hf format during training. This should bring us one step closer to pytorch#1210
Copied from github.com/pytorch/pull/1441, tested manually via forge --------- Co-authored-by: Allen Wang <allencwang@fb.com>
## This pr adds a model state dict conversion between TT and HF. It includes to and from huggingface, and importantly performs a permutation on the q and k attention matrices to address the differences in RoPE implementation between native LLaMA and HuggingFace. Thanks to @rlrs and @vwxyzjn for finding and helping to fix this issue pytorch#335, pytorch#1291 (comment) ### Testing I tested the correctness of the model conversion by using the two methods greedy decoding, and kl_divergence for thorough comparison. To test the from_hf script I downloaded a model from HF hub, converted it using the script, and ran forward passes using torchtitan. To test the to_hf script I obtained original llama weights and used the verified llama->dcp script. Then I used the convert_to_hf script to convert these weights to safetensors checkpoint. For kl_divergence I tested each to_hf and from_hf against the baseline hf model, and compared these to the to_hf and from_hf weights when not performing the permutation. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ✅ | -3.8356e-15 | -1.4431e-14 | | ❌ | 3.0658e-06 | 9.6463e-06 | When comparing, we can clearly see the kl div loss is many orders of magnitude higher when not permuted, showing that these probability distributions don't accurately represent the baseline hf's probability distribution. However, due to the small amount of weights that need to be permuted in this case, the loss is still not very high in the incorrect case, and can be deceiving if only using this as the evaluation metric. Therefore we also use greedy decoding with long generated sequences, calculating the loss as the exact match ratio of generated tokens. | permuted wq and wk | kl_div (hf->tt) | kl_div (tt->hf) | | --- | --- | --- | | ❌ | | | | ✅ | | | ### Usage The model conversion can be done in two ways. The first direct way is to use the new convert_from_hf.py or convert_to_hf.py script, but requires loading the entire model weights into cpu memory. The second way is to use the training config options to load/save in hf format during training. This should bring us one step closer to pytorch#1210
Copied from github.com/pytorch/pull/1441, tested manually via forge --------- Co-authored-by: Allen Wang <allencwang@fb.com>
This pr adds a model state dict conversion between TT and HF.
It includes to and from huggingface, and importantly performs a permutation on the q and k attention matrices to address the differences in RoPE implementation between native LLaMA and HuggingFace. Thanks to @rlrs and @vwxyzjn for finding and helping to fix this issue #335, #1291 (comment)
Testing
I tested the correctness of the model conversion by using the two methods greedy decoding, and kl_divergence for thorough comparison.
To test the from_hf script I downloaded a model from HF hub, converted it using the script, and ran forward passes using torchtitan. To test the to_hf script I obtained original llama weights and used the verified llama->dcp script. Then I used the convert_to_hf script to convert these weights to safetensors checkpoint.
For kl_divergence I tested each to_hf and from_hf against the baseline hf model, and compared these to the to_hf and from_hf weights when not performing the permutation.
When comparing, we can clearly see the kl div loss is many orders of magnitude higher when not permuted, showing that these probability distributions don't accurately represent the baseline hf's probability distribution. However, due to the small amount of weights that need to be permuted in this case, the loss is still not very high in the incorrect case, and can be deceiving if only using this as the evaluation metric. Therefore we also use greedy decoding with long generated sequences to sanity check our results.
Usage
The model conversion can be done in two ways. The first direct way is to use the new convert_from_hf.py or convert_to_hf.py script, but requires loading the entire model weights into cpu memory. The second way is to use the training config options to load/save in hf format during training.
This should bring us one step closer to #1210