You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/checkpoint.md
+33-35Lines changed: 33 additions & 35 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,11 @@
1
-
# How to use checkpoints in TorchTitan
1
+
# How to use checkpointing in `torchtitan`
2
2
3
-
You may want to enable checkpointing in TorchTitan for better fault tolerance during training, or to enable easier importing and exporting of weights between TorchTitan and other libraries. TorchTitan offers varying degrees of support for other checkpoint formats which are listed further below.
3
+
You may want to enable checkpointing in `torchtitan` for better fault tolerance during training, or to enable easier importing and exporting of weights between `torchtitan` and other libraries. `torchtitan` offers varying degrees of support for other checkpoint formats which are listed further below.
4
4
5
5
## A general guide to use checkpoints during training
6
6
7
7
1. ENABLE CHECKPOINTING
8
-
In your torchtitan training config, ensure that `enable_checkpoint` is set to True.
8
+
In your `torchtitan` training config, ensure that `enable_checkpoint` is set to True.
9
9
```
10
10
[checkpoint]
11
11
enable_checkpoint = true
@@ -50,24 +50,38 @@ last_save_model_only = true
50
50
export_dtype = "bfloat16"
51
51
```
52
52
53
-
A more exhaustive and up-to-date list of checkpoint config options can be found in torchtitan/config/job_config.py
53
+
A more exhaustive and up-to-date list of checkpoint config options can be found in `torchtitan/config/job_config.py`
54
+
55
+
## Creating a seed checkpoint
56
+
Sometimes one needs to create a seed checkpoint to initialize a model from step 0.
57
+
E.g. it is hard, if not impossible, for meta initialization on multiple devices to reproduce the initialization on a single device.
58
+
A seed checkpoint does initialization of the model on a single CPU, and can be loaded from another job on an arbitrary number of GPUs via DCP resharding.
59
+
60
+
To create a seed checkpoint, use the same model config as you use for training.
`torchtitan` offers two ways to work with Hugging Face models: either by directly saving and loading a Hugging Face checkpoint during training, or by using an example conversion script to directly reformat the model weights on cpu.
58
70
59
-
If you want to continue training from an existing model checkpoint, the checkpoint must be in the DCP format expected by the checkpoint manager.
60
-
An example script for converting the original Llama3 checkpoints into the expected DCP format can be found in `scripts/convert_llama_to_dcp.py`.
71
+
1. You can directly save huggingface model weights during training by using the `--checkpoint.last_save_in_safetensors_format` and `--checkpoint.last_save_model_only` options together. To directly load a `torchtitan` training session from a huggingface safetensors file, simply enable `--checkpoint.initial_load_model_only` and set `--checkpoint.initial_load_path` to the directory containing the huggingface checkpoint.
72
+
73
+
2. To directly reformat the weights without the need to run a training loop, run the corresponding conversion script. The naming scheme is `torchtitan`-centric, e.g. convert_from_hf means convert hf->tt.
61
74
62
-
The script expects a path to the original checkpoint files, and a path to an output directory:
This guide will walk you through the steps required to convert a checkpoint from torchtitan so that it can be loaded into torchtune.
84
+
This guide will walk you through the steps required to convert a checkpoint from `torchtitan` so that it can be loaded into pt format.
71
85
72
86
1. CHECKPOINT CONFIGURATION
73
87
```
@@ -83,36 +97,20 @@ export_dtype = "bfloat16"
83
97
Once the above have been set, the final checkpoint at the end of the training step will consist of model only with the desired export dtype. However, if the final step has not been reached yet, full checkpoints will still be saved so that training can be resumed.
84
98
85
99
3. CONVERT SHARDED CHECKPOINTS TO A SINGLE FILE\
86
-
Finally, once you have obtained the last checkpoint, you can use the following command to convert the sharded checkpoints to a single .pt file that can be loaded into torchtune:
100
+
Finally, once you have obtained the last checkpoint, you can use the following command to convert the sharded checkpoints to a single .pt file.
That's it. You have now successfully converted a sharded torchtitan checkpoint for use in torchtune.
94
-
95
-
### HuggingFace
96
-
TorchTitan supports two methods now for supporting huggingface, directly saving and loading a hf checkpoint during training, or using an example conversion script to directly reformat the weights.
97
-
98
-
1. You can directly save huggingface model weights during training by using the `--checkpoint.last_save_in_safetensors_format` and `--checkpoint.last_save_model_only` options together. To directly load a torchtitan training session from a huggingface safetensors file, simply enable `--checkpoint.initial_load_model_only` and set `--checkpoint.initial_load_path` to the directory containing the huggingface checkpoint.
99
-
100
-
2. To directly reformat the weights without the need to run a training loop, run the corresponding conversion script. The naming scheme is torchtitan-centric, e.g. convert_from_hf means convert hf->tt.
107
+
That's it. You have now successfully converted a sharded `torchtitan` checkpoint for use with pytorch formats.
Sometimes one needs to create a seed checkpoint to initialize a model from step 0.
111
-
E.g. it is hard, if not impossible, for meta initialization on multiple devices to reproduce the initialization on a single device.
112
-
A seed checkpoint does initialization of the model on a single CPU, and can be loaded from another job on an arbitrary number of GPUs via DCP resharding.
111
+
An example script for converting the original Llama3 checkpoints into DCP format to be used with `torchtitan` can be found in `scripts/convert_from_llama.py`.
113
112
114
-
To create a seed checkpoint, use the same model config as you use for training.
115
-
e.g.
113
+
The script expects a path to the original checkpoint files, and a path to an output directory:
0 commit comments