pytorch · tianyu-l · Jul 25, 2025 · Jul 24, 2025 · wwwjn · Jul 24, 2025
@@ -20,6 +20,7 @@ To use the latest features of `torchtitan`, we recommend using the most recent P
 
 
 ## Latest News
+- [2025/07] We published [instructions](/torchtitan/models/README.md) on how to add a model to `torchtitan`.
 - [2025/07] We released `torchtitan` [v0.1.0](https://github.com/pytorch/torchtitan/releases), and also set up nightly builds.
 - [2025/04] Our paper was accepted by [ICLR 2025](https://iclr.cc/virtual/2025/poster/29620).
 - [2025/04] [Llama 4](torchtitan/experiments/llama4/) initial support is available as an experiment.
@@ -37,7 +38,7 @@ To use the latest features of `torchtitan`, we recommend using the most recent P
 
 Our mission is to accelerate innovation in the field of generative AI by empowering researchers and developers to explore new modeling architectures and infrastructure techniques.
 
-The guiding principles when building `torchtitan`
+The Guiding Principles when building `torchtitan`
 * Designed to be easy to understand, use and extend for different training purposes.
 * Minimal changes to the model code when applying multi-dimensional parallelism.
 * Bias towards a clean, minimal codebase while providing basic reusable / swappable components.

diff --git a/torchtitan/experiments/README.md b/torchtitan/experiments/README.md
@@ -4,8 +4,8 @@ To accelerate contributions to and innovations around `torchtitan`, we are addin
 
 We provide this `experiments/` folder to host experiments that add significant value to `torchtitan`, with the following principles. We refer to the part of `torchtitan` outside `experiments` as `core`.
 1. Each subfolder in `experiments` will be an experiment, with a clear theme which can be flexible, such as
-    - a new model, or preferably a new model architecture, with its training infrastructure including parallelization functions;
-    - an enhancement or addition to the existing infrastructure of `torchtitan`.
+    - A new model, or preferably a new model architecture, with its training infrastructure including parallelization functions. Please see the [instructions](/torchtitan/models/README.md) on how to contribute a new model.
+    - An enhancement or addition to the existing infrastructure of `torchtitan`.
 2. It is the contributors' responsibility to justify the value of an experiment. `torchtitan` team will review proposals on a case-by-case basis. As part of the contribution, the contributors should provide documentation that clearly showcases the motivation and innovation of an experiment, including reports on performance and loss convergence.
 3. An experiment should reuse existing `torchtitan` code as much as possible, such as modules in [`components/`](../components/) (via a new [`TrainSpec`](../protocols/train_spec.py)) and [`train.py`](../train.py). For a list of extension points we provide, please refer to [docs/extension.md](../../docs/extension.md).
     - The extension points are subject to change. We kindly request that contributors provide feedback if they encounter issues reusing any components, rather than simply using a copy-and-paste approach.

@@ -0,0 +1,78 @@
+This note outlines the process of adding a new model in the `torchtitan` repo. In most cases, new models should be added first under the `torchtitan/experiments` folder. For criteria of contributions, please see the [Contributing Guidelines](/torchtitan/experiments/README.md) therein. In general, please adhere to the [Guiding Principles](/README.md#overview) of `torchtitan`.
+
+For offline explorations, we recommend the same steps, unless otherwise noted.
+
+## Adding the model
+
+Please refer to the [Llama 3 folder](.llama3) as an example.
+
+The folder should be organized as follows
+- `model` folder: a self-contained folder of model definition and args
+  - `args.py`
+    - Inherit [`BaseModelArgs`](/torchtitan/protocols/model.py) and implement the interfaces.
+      - `get_nparams_and_flops()` will be used to understand model size and compute throughput.
+      - `update_from_config()` updates the model args from training configs. To extend training configs, see the bullet point below on `job_config.py`.
+  - `model.py`
+    - NOTE: Please adhere to the guiding principles and write single-device model code.
+    - NOTE: We prioritize readability over flexibility. The preferred style is to not share modules among different models, except for the most common and complicated ones.
+    - Inherit [`ModelProtocol`](/torchtitan/protocols/model.py) and implement the interfaces.
+      - `__init__()` consumes a `ModelArgs` input to build the model
+      - `init_weights()` is used to properly initialize the parameters and buffers in the model. Please define it in a recursive way so that every submodule has its own `init_weights()`.
+    - Add additional files to reduce the complexity of `model.py` if it grows too large or complex, e.g. moe.py to host the `MoE`, `Router`, and `GroupedExperts` modules.
+  - `state_dict_adapter.py`
+    - Inherit [`StateDictAdapter`](/torchtitan/protocols/state_dict_adapter.py) to implement state dict mappings between `torchtitan` model definition and other model definitions (e.g. from HuggingFace so that we can save / load model checkpoints in HF formats).
+    - There are multiple ways such adapters could be used
+      - Checkpoint conversion scripts in `scripts/checkpoint_conversion/` will use them to adapt state dicts containing non-sharded `torch.Tensor` on CPU.
+      - During training, [`CheckpointManager`](/torchtitan/components/checkpoint.py) will use them to adapt state dicts containing (potentially sharded) `DTensor` on GPUs to save / load checkpoints in HF format.
+      - In post-training, `to_hf()` helps convert a torchtitan model to HF model, which can be used for inference by other frameworks.
+    - This is optional for offline exploration.
+- `infra` folder: containing the functions used to parallelize the model using PyTorch native techniques
+  - `parallelize.py`
+    - apply training techniques in the following order
+      - TP (and EP if the model has MoE architecture)
+      - activation checkpointing
+      - `torch.compile`
+      - FSDP /  HSDP
+      - NOTE: currently CP support for language models is enabled via a context manager in `torchtitan/train.py`. Ideally no extra work is needed to enable CP.
+  - `pipeline.py` (optional if model size is small)
+    - apply PP
+  - Include other util files if necessary.
+- `__init__.py`
+  - A dictionary of the actual model configurations, of the type `[str: ModelArgs]`.
+  - Call `register_train_spec` to specify a [`TrainSpec`](/torchtitan/protocols/train_spec.py), consisting a tuple of
+    - model name, model class, model args
+    - parallelizing function, pipelining function
+    - builder functions for optimizer, lr scheduler, data loader, tokenizer, and loss function
+      - More often than not, existing components can be reused.
+      - Adding new datasets requires the `torchtitan` team’s review and legal approval.
+      - Try to have minimal dependency on external libraries, if any.
+    - state dict adapter
+  - Read [more](/docs/extension.md#trainspec) on `TrainSpec`.
+- `README.md`
+  - Include [instructions](/README.md#downloading-a-tokenizer) to download tokenizers / encoders.
+  - Include instructions to download model checkpoints for continued pretraining or post training.
+  - Update the current status of development, including the supported features and coming features.
+  - This is optional for offline exploration.
+- `job_config.py` (if necessary)
+  - Sometimes a new model needs to access additional configs, to be consumed by various training components. Read the [guidance](/docs/extension.md#train-script) on extending `JobConfig`.
+- `train.py` (only if absolutely necessary)
+  - Sometimes `torchtitan/train.py` may not be enough to run the model. There is a [tradeoff](/docs/extension.md#train-script) between extending the existing one vs. having a new one.
+  - Even if a new one needs to be added, it should reuse `torchtitan/train.py` as much as possible. See `torchtitan/experiments/flux/train.py` as an example.
+- `train_configs` folder
+  - There should be one `.toml` file for each model variant (e.g. Llama 3.1 8B / 70B / 405B) as well as a `debug_model.toml`.
+  - They should be verified with real training jobs, in terms of optimized throughput and loss converging.
+
+## Testing and Benchmarking
+- Numerics testing
+  - One way of doing this E2E is to load the same model checkpoint into the `torchtitan` model and the HF model, and compare the model output given the same input. This assumes
+    - HF implementation is correct.
+    - The correctness of a `torchtitan` model and the corresponding state dict adapter together indicates the correctness of both.
+- Loss converging
+  - If there is a verified baseline, compare the loss curves with the baseline.
+  - For comparisons within `torchtitan`, see the [guidelines](/docs/converging.md).
+- Performance benchmarking
+  - Please refer to the [benchmarks](/benchmarks/) folder.
+- CI tests
+  - Including unit tests and integration tests, see [examples](/tests/).
+  - If the model folder is under the experiments folder, put the tests under the model folder. Otherwise, put the tests under the `/tests` folder.
+  - Add necessary GitHub [workflows](/.github/workflows/).
@@ -7,7 +7,7 @@
 from abc import ABC, abstractmethod
 from typing import Any
 
-from torchtitan.protocols import BaseModelArgs
+from .model import BaseModelArgs
 
 
 class StateDictAdapter(ABC):

@@ -19,7 +19,9 @@
 from torchtitan.components.tokenizer import BaseTokenizer
 from torchtitan.components.validate import BaseValidator
 from torchtitan.config import LRScheduler
-from torchtitan.protocols import BaseModelArgs, ModelProtocol, StateDictAdapter
+
+from .model import BaseModelArgs, ModelProtocol
+from .state_dict_adapter import StateDictAdapter
 
 
 ParallelizeFunction: TypeAlias = Callable[..., nn.Module]