Removes support for huggingface config.json due to the unecessary increased overhead and complexity

wesleytruong · wesleytruong · commit 63c3fc550560 · 2025-07-24T01:52:17.000-07:00
Updates the checkpoint.md
diff --git a/docs/checkpoint.md b/docs/checkpoint.md
@@ -1,11 +1,11 @@
-# How to use checkpoints in TorchTitan
+# How to use checkpointing in `torchtitan`
 
-You may want to enable checkpointing in TorchTitan for better fault tolerance during training, or to enable easier importing and exporting of weights between TorchTitan and other libraries. TorchTitan offers varying degrees of support for other checkpoint formats which are listed further below.
+You may want to enable checkpointing in `torchtitan` for better fault tolerance during training, or to enable easier importing and exporting of weights between `torchtitan` and other libraries. `torchtitan` offers varying degrees of support for other checkpoint formats which are listed further below.
 
 ## A general guide to use checkpoints during training
 
 1. ENABLE CHECKPOINTING
-In your torchtitan training config, ensure that `enable_checkpoint` is set to True.
+In your `torchtitan` training config, ensure that `enable_checkpoint` is set to True.
 ```
 [checkpoint]
 enable_checkpoint = true
@@ -50,24 +50,38 @@ last_save_model_only = true
 export_dtype = "bfloat16"
 ```
 
-A more exhaustive and up-to-date list of checkpoint config options can be found in torchtitan/config/job_config.py
+A more exhaustive and up-to-date list of checkpoint config options can be found in `torchtitan/config/job_config.py`
+
+## Creating a seed checkpoint
+Sometimes one needs to create a seed checkpoint to initialize a model from step 0.
+E.g. it is hard, if not impossible, for meta initialization on multiple devices to reproduce the initialization on a single device.
+A seed checkpoint does initialization of the model on a single CPU, and can be loaded from another job on an arbitrary number of GPUs via DCP resharding.
+
+To create a seed checkpoint, use the same model config as you use for training.
+e.g.
+```bash
+NGPU=1 CONFIG=<path_to_model_config> ./run_train.sh --checkpoint.enable_checkpoint --checkpoint.create_seed_checkpoint --parallelism.data_parallel_replicate_degree 1 --parallelism.data_parallel_shard_degree 1 --parallelism.tensor_parallel_degree 1 --parallelism.pipeline_parallel_degree 1 --parallelism.context_parallel_degree 1 --parallelism.expert_parallel_degree 1
+```
 
 ## Conversion support
 
-### PyTorch Meta Llama
+### HuggingFace
+`torchtitan` offers two ways to work with Hugging Face models: either by directly saving and loading a Hugging Face checkpoint during training, or by using an example conversion script to directly reformat the model weights on cpu.
 
-If you want to continue training from an existing model checkpoint, the checkpoint must be in the DCP format expected by the checkpoint manager.
-An example script for converting the original Llama3 checkpoints into the expected DCP format can be found in `scripts/convert_llama_to_dcp.py`.
+1. You can directly save huggingface model weights during training by using the `--checkpoint.last_save_in_safetensors_format` and `--checkpoint.last_save_model_only` options together. To directly load a `torchtitan` training session from a huggingface safetensors file, simply enable `--checkpoint.initial_load_model_only` and set `--checkpoint.initial_load_path` to the directory containing the huggingface checkpoint.
+
+2. To directly reformat the weights without the need to run a training loop, run the corresponding conversion script. The naming scheme is `torchtitan`-centric, e.g. convert_from_hf means convert hf->tt.
 
-The script expects a path to the original checkpoint files, and a path to an output directory:
 ```bash
-python -m scripts.convert_from_llama <input_dir> <output_dir>
+python ./scripts/checkpoint_conversion/convert_from_hf.py <input_dir> <output_dir> --model_name <model_name> --model_flavor <model_flavor>
+python ./scripts/checkpoint_conversion/convert_to_hf.py <input_dir> <output_dir> --model_name <model_name> --model_flavor <model_flavor>
+# e.g.
+python ./scripts/convert_from_hf.py ~/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/8cde5ca8380496c9a6cc7ef3a8b46a0372a1d920/ ./outputs/checkpoint/step-0 --model_name llama3 --model_flavor 8B
 ```
 
+### Torch
 
-### Torchtune
-
-This guide will walk you through the steps required to convert a checkpoint from torchtitan so that it can be loaded into torchtune.
+This guide will walk you through the steps required to convert a checkpoint from `torchtitan` so that it can be loaded into pt format.
 
 1. CHECKPOINT CONFIGURATION
 ```
@@ -83,36 +97,20 @@ export_dtype = "bfloat16"
 Once the above have been set, the final checkpoint at the end of the training step will consist of model only with the desired export dtype. However, if the final step has not been reached yet, full checkpoints will still be saved so that training can be resumed.
 
 3. CONVERT SHARDED CHECKPOINTS TO A SINGLE FILE\
-Finally, once you have obtained the last checkpoint, you can use the following command to convert the sharded checkpoints to a single .pt file that can be loaded into torchtune:
+Finally, once you have obtained the last checkpoint, you can use the following command to convert the sharded checkpoints to a single .pt file.
 
-```
+```bash
 python -m torch.distributed.checkpoint.format_utils dcp_to_torch torchtitan/outputs/checkpoint/step-1000 checkpoint.pt
 ```
 
 
-That's it. You have now successfully converted a sharded torchtitan checkpoint for use in torchtune.
-
-### HuggingFace
-TorchTitan supports two methods now for supporting huggingface, directly saving and loading a hf checkpoint during training, or using an example conversion script to directly reformat the weights.
-
-1. You can directly save huggingface model weights during training by using the `--checkpoint.last_save_in_safetensors_format` and `--checkpoint.last_save_model_only` options together. To directly load a torchtitan training session from a huggingface safetensors file, simply enable `--checkpoint.initial_load_model_only` and set `--checkpoint.initial_load_path` to the directory containing the huggingface checkpoint.
-
-2. To directly reformat the weights without the need to run a training loop, run the corresponding conversion script. The naming scheme is torchtitan-centric, e.g. convert_from_hf means convert hf->tt.
+That's it. You have now successfully converted a sharded `torchtitan` checkpoint for use with pytorch formats.
 
-```
-python ./scripts/checkpoint_conversion/convert_from_hf.py <input_dir> <output_dir> --model_name <model_name> --model_flavor <model_flavor>
-python ./scripts/checkpoint_conversion/convert_to_hf.py <input_dir> <output_dir> --model_name <model_name> --model_flavor <model_flavor>
-# e.g.
-python ./scripts/convert_from_hf.py ~/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/8cde5ca8380496c9a6cc7ef3a8b46a0372a1d920/ ./outputs/checkpoint/step-0 --model_name llama3 --model_flavor 8B
-```
+### PyTorch Meta Llama
 
-### Seed Checkpoint
-Sometimes one needs to create a seed checkpoint to initialize a model from step 0.
-E.g. it is hard, if not impossible, for meta initialization on multiple devices to reproduce the initialization on a single device.
-A seed checkpoint does initialization of the model on a single CPU, and can be loaded from another job on an arbitrary number of GPUs via DCP resharding.
+An example script for converting the original Llama3 checkpoints into DCP format to be used with `torchtitan` can be found in `scripts/convert_from_llama.py`.
 
-To create a seed checkpoint, use the same model config as you use for training.
-e.g.
+The script expects a path to the original checkpoint files, and a path to an output directory:
 ```bash
-NGPU=1 CONFIG=<path_to_model_config> ./run_train.sh --checkpoint.enable_checkpoint --checkpoint.create_seed_checkpoint --parallelism.data_parallel_replicate_degree 1 --parallelism.data_parallel_shard_degree 1 --parallelism.tensor_parallel_degree 1 --parallelism.pipeline_parallel_degree 1 --parallelism.context_parallel_degree 1 --parallelism.expert_parallel_degree 1
+python -m scripts.convert_from_llama <input_dir> <output_dir>
 ```
diff --git a/scripts/checkpoint_conversion/convert_from_hf.py b/scripts/checkpoint_conversion/convert_from_hf.py
@@ -31,7 +31,7 @@ def convert_from_hf(input_dir, output_dir, model_name, model_flavor):
     # get state dict in tt format with allocated memory
     state_dict = model._get_state_dict()
     # convert empty state dict to hf format so that hf weights can be loaded into it
-    hf_state_dict, _ = sd_adapter.to_hf(state_dict)
+    hf_state_dict = sd_adapter.to_hf(state_dict)
     dcp.load(
         hf_state_dict,
         storage_reader=HuggingFaceStorageReader(path=input_dir),
@@ -45,9 +45,9 @@ def convert_from_hf(input_dir, output_dir, model_name, model_flavor):
 
 
 if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="Convert Llama weights to DCP format.")
+    parser = argparse.ArgumentParser(description="Convert HF checkpoint to DCP format.")
     parser.add_argument(
-        "input_dir", type=Path, help="Input directory with original Llama weights."
+        "input_dir", type=Path, help="Input directory with HF checkpoint"
     )
     parser.add_argument("output_dir", type=Path, help="Output directory for DCP.")
     parser.add_argument("--model_name", type=str, nargs="?", default="llama3")
diff --git a/scripts/checkpoint_conversion/convert_to_hf.py b/scripts/checkpoint_conversion/convert_to_hf.py
@@ -5,7 +5,6 @@
 # LICENSE file in the root directory of this source tree.
 
 import argparse
-import json
 from pathlib import Path
 
 import torch
@@ -38,7 +37,7 @@ def convert_to_hf(input_dir, output_dir, model_name, model_flavor):
     )
 
     # convert state dict tt->hf
-    hf_state_dict, config_json = sd_adapter.to_hf(state_dict)
+    hf_state_dict = sd_adapter.to_hf(state_dict)
 
     fqn_to_index_mapping = {}
     num_fqns_per_file = 30
@@ -60,17 +59,15 @@ def convert_to_hf(input_dir, output_dir, model_name, model_flavor):
         storage_writer=storage_writer,
     )
 
-    config_path = output_dir / "config.json"
-    with config_path.open("w") as f:
-        json.dump(config_json, f, indent=4)
-
 
 if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="Convert Llama weights to HF format.")
+    parser = argparse.ArgumentParser(description="Convert DCP weights to HF format.")
+    parser.add_argument(
+        "input_dir", type=Path, help="Input directory with DCP weights."
+    )
     parser.add_argument(
-        "input_dir", type=Path, help="Input directory with original Llama weights."
+        "output_dir", type=Path, help="Output directory for HF checkpoint."
     )
-    parser.add_argument("output_dir", type=Path, help="Output directory for DCP.")
     parser.add_argument("--model_name", type=str, nargs="?", default="llama3")
     parser.add_argument("--model_flavor", type=str, nargs="?", default="8B")
     args = parser.parse_args()
diff --git a/torchtitan/components/checkpoint.py b/torchtitan/components/checkpoint.py
@@ -6,15 +6,13 @@
 
 import enum
 import functools
-import json
 import os
 import queue
 import re
 import shutil
 import threading
 import time
 from concurrent.futures import Future
-from pathlib import Path
 from typing import Any
 
 import torch
@@ -359,7 +357,7 @@ def dcp_save(
             assert (
                 self.sd_adapter is not None
             ), "trying to save checkpoint in HF safetensors format, but sd_adapter is not provided."
-            state_dict, config_json = self.sd_adapter.to_hf(state_dict)
+            state_dict = self.sd_adapter.to_hf(state_dict)
 
             fqn_to_index_mapping = {}
             num_fqns_per_file = 30
@@ -404,11 +402,6 @@ def dcp_save(
                 checkpoint_id=checkpoint_save_id,
             )
 
-        if to_hf:
-            config_path = Path(checkpoint_id) / "config.json"
-            with config_path.open("w") as f:
-                json.dump(config_json, f, indent=4)
-
         if enable_garbage_collection:
             GarbageCollection.collect("GC collection invoked by checkpointer.")
 
@@ -432,7 +425,7 @@ def dcp_load(
             assert (
                 self.sd_adapter is not None
             ), "trying to load checkpoint in HF safetensors format, but sd_adapter is not provided."
-            hf_state_dict, _ = self.sd_adapter.to_hf(state_dict)
+            hf_state_dict = self.sd_adapter.to_hf(state_dict)
 
             dcp.load(
                 hf_state_dict,
diff --git a/torchtitan/experiments/forge/engine.py b/torchtitan/experiments/forge/engine.py
@@ -8,9 +8,9 @@
 from typing import Generator
 
 import torch
+from torch.distributed.elastic.multiprocessing.errors import record
 
 import torchtitan.protocols.train_spec as train_spec_module
-from torch.distributed.elastic.multiprocessing.errors import record
 from torchtitan.components.checkpoint import CheckpointManager
 from torchtitan.components.loss import rescale_accumulated_loss
 from torchtitan.distributed import ParallelDims, utils as dist_utils
diff --git a/torchtitan/models/llama3/model/state_dict_adapter.py b/torchtitan/models/llama3/model/state_dict_adapter.py
@@ -5,7 +5,7 @@
 # LICENSE file in the root directory of this source tree.
 
 import re
-from typing import Any, Tuple
+from typing import Any
 
 from torchtitan.protocols.state_dict_adapter import StateDictAdapter
 
@@ -55,9 +55,7 @@ def _reverse_permute(self, w, n_heads_arg, dim1=None, dim2=None):
             .reshape(dim1, dim2)
         )
 
-    def to_hf(
-        self, state_dict: dict[str, Any]
-    ) -> Tuple[dict[str, Any], dict[str, Any]]:
+    def to_hf(self, state_dict: dict[str, Any]) -> dict[str, Any]:
         to_hf_map = {v: k for k, v in self.from_hf_map.items()}
 
         n_heads = self.model_args.n_heads
@@ -91,27 +89,7 @@ def to_hf(
 
             hf_state_dict[new_key] = value
 
-        ffn_hidden_dim = int(self.model_args.dim * 4 * 2 / 3)
-        if self.model_args.ffn_dim_multiplier:
-            ffn_hidden_dim = int(ffn_hidden_dim * self.model_args.ffn_dim_multiplier)
-        multiple_of = self.model_args.multiple_of
-        ffn_hidden_dim = multiple_of * (
-            (ffn_hidden_dim + multiple_of - 1) // multiple_of
-        )  # hacky way to get ffn_hidden_dim, follows the calculation in models.TransformerBlock and model.FeedForward
-
-        config_json = {
-            "architectures": ["LlamaForCausalLM"],
-            "hidde": "silu",
-            "hidden_size": self.model_args.dim,
-            "intermediate_size": ffn_hidden_dim,
-            "model_type": "llama",
-            "num_attention_heads": self.model_args.n_heads,
-            "num_hidden_layers": self.model_args.n_layers,
-            "num_key_value_heads": self.model_args.n_kv_heads,
-            "vocab_size": self.model_args.vocab_size,
-        }
-
-        return hf_state_dict, config_json
+        return hf_state_dict
 
     def from_hf(self, hf_state_dict: dict[str, Any]) -> dict[str, Any]:
         n_heads = self.model_args.n_heads
diff --git a/torchtitan/protocols/__init__.py b/torchtitan/protocols/__init__.py
@@ -1,3 +1,9 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
 from .model import BaseModelArgs, ModelProtocol
 from .model_converter import ModelConverter, ModelConvertersContainer
 from .state_dict_adapter import StateDictAdapter
diff --git a/torchtitan/protocols/state_dict_adapter.py b/torchtitan/protocols/state_dict_adapter.py
@@ -5,7 +5,7 @@
 # LICENSE file in the root directory of this source tree.
 
 from abc import ABC, abstractmethod
-from typing import Any, Tuple
+from typing import Any
 
 from torchtitan.protocols import BaseModelArgs
 
@@ -22,9 +22,7 @@ def __init__(self, model_args: BaseModelArgs):
         pass
 
     @abstractmethod
-    def to_hf(
-        self, state_dict: dict[str, Any]
-    ) -> Tuple[dict[str, Any], dict[str, Any]]:
+    def to_hf(self, state_dict: dict[str, Any]) -> dict[str, Any]:
         """Convert from native model state dict to HuggingFace format.
 
         Args:
diff --git a/torchtitan/train.py b/torchtitan/train.py
@@ -142,15 +142,15 @@ def __init__(self, job_config: JobConfig):
         )
 
         # build model (using meta init)
-        self.model_args = self.train_spec.model_args[job_config.model.flavor]
+        model_args = self.train_spec.model_args[job_config.model.flavor]
         # set the model args from training job configs
-        self.model_args.update_from_config(job_config)
+        model_args.update_from_config(job_config)
 
         logger.info(
-            f"Building {self.train_spec.name} {job_config.model.flavor} with {self.model_args}"
+            f"Building {self.train_spec.name} {job_config.model.flavor} with {model_args}"
         )
         with torch.device("meta"):
-            model = self.train_spec.model_cls(self.model_args)
+            model = self.train_spec.model_cls(model_args)
 
         # Build the collection of model converters. No-op if `model.converters` empty
         model_converters = build_model_converters(job_config, parallel_dims)
@@ -163,15 +163,15 @@ def __init__(self, job_config: JobConfig):
             else self.train_spec.build_metrics_processor_fn
         )
         self.metrics_processor = build_metrics_processor_fn(
-            job_config, parallel_dims, self.model_args
+            job_config, parallel_dims, model_args
         )
         color = self.metrics_processor.color
 
         # calculate model size and flops per token
         (
             model_param_count,
             self.metrics_processor.num_flops_per_token,
-        ) = self.model_args.get_nparams_and_flops(model, job_config.training.seq_len)
+        ) = model_args.get_nparams_and_flops(model, job_config.training.seq_len)
 
         logger.info(
             f"{color.blue}Model {self.train_spec.name} {job_config.model.flavor} "
@@ -234,7 +234,7 @@ def __init__(self, job_config: JobConfig):
                 parallel_dims,
                 job_config,
                 self.device,
-                self.model_args,
+                model_args,
                 self.train_spec.parallelize_fn,
                 self.loss_fn,
             )
@@ -303,7 +303,7 @@ def __init__(self, job_config: JobConfig):
             states={"train_state": self},
             checkpoint_config=job_config.checkpoint,
             sd_adapter=(
-                self.train_spec.state_dict_adapter(self.model_args)
+                self.train_spec.state_dict_adapter(model_args)
                 if self.train_spec.state_dict_adapter
                 else None
             ),
@@ -430,7 +430,7 @@ def forward_backward_step(
             with self.train_context(optional_context_parallel_ctx):
                 assert len(model_parts) == 1
                 with self.maybe_enable_amp:
-                    pred = model_parts[0](inputs)
+                    pred = model_parts[0](inputs, self.tokenizer.eos_id)
                     loss = self.loss_fn(pred, labels)
                 # need to free to before bwd to avoid peaking memory
                 del pred