You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit updates the converter to work with recent updates to jobconfig and configmanager.
Additionally it changes the state dict adapter from static class to an instance-type class and consumes the model args in its init to eliminate guesswork during state dict conversion.
It also adds support for building the config.json when converting to hf since this file is required by hf for important tasks such as inference.
It also moves model_args to a separate file from train_spec to solve a circular import with state_dict_adapter.
You may want to enable checkpointing in TorchTitan for better fault tolerance during training, or to enable easier importing and exporting of weights between TorchTitan and other libraries. TorchTitan offers varying degrees of support for other checkpoint formats which are listed further below.
10
4
5
+
## A general guide to use checkpoints during training
11
6
12
-
## How to convert a torchtitan checkpoint for use in torchtune
13
-
14
-
This guide will walk you through the steps required to convert a checkpoint from torchtitan so that it can be loaded into torchtune.
15
-
16
-
### Steps
17
7
1. ENABLE CHECKPOINTING
18
8
In your torchtitan training config, ensure that `enable_checkpoint` is set to True.
19
9
```
@@ -22,8 +12,6 @@ enable_checkpoint = true
22
12
folder = "checkpoint"
23
13
interval = 500
24
14
```
25
-
26
-
27
15
2. SAVE MODEL ONLY
28
16
By setting `last_save_model_only` to `True`, the checkpoint will only contain the model and exclude the optimizer state and extra train states, resulting in a smaller checkpoint size.
29
17
```
@@ -41,7 +29,17 @@ last_save_model_only = true
41
29
export_dtype = "bfloat16"
42
30
```
43
31
44
-
4. EXAMPLE CHECKPOINT CONFIGURATION
32
+
4. EXCLUDING SPECIFIC KEYS FROM CHECKPOINT LOADING
33
+
In some cases, you may want to partially load from a previous-trained checkpoint and modify certain settings, such as the number of GPUs or the current step. To achieve this, you can use the `exclude_from_loading` parameter to specify which keys should be excluded from loading.
34
+
This parameter takes a list of string that should be excluded from loading.
When used in command line, the parameter should be a comma-separated list of strings. For example: `--checkpoint.exclude_from_loading data_loader,lr_scheduler`.
41
+
42
+
5. EXAMPLE CHECKPOINT CONFIGURATION
45
43
```
46
44
[checkpoint]
47
45
enable_checkpoint = true
@@ -52,30 +50,63 @@ last_save_model_only = true
52
50
export_dtype = "bfloat16"
53
51
```
54
52
55
-
5. SAVE THE FINAL CHECKPOINT\
56
-
Once the above have been set, the final checkpoint at the end of the training step will consist of model only with the desired export dtype. However, if the final step has not been reached yet, full checkpoints will still be saved so that training can be resumed.
53
+
A more exhaustive and up-to-date list of checkpoint config options can be found in torchtitan/config/job_config.py
57
54
58
-
6. CONVERT SHARDED CHECKPOINTS TO A SINGLE FILE\
59
-
Finally, once you have obtained the last checkpoint, you can use the following command to convert the sharded checkpoints to a single .pt file that can be loaded into torchtune:
7. EXCLUDING SPECIFIC KEYS FROM CHECKPOINT LOADING
66
-
In some cases, you may want to partially load from a previous-trained checkpoint and modify certain settings, such as the number of GPUs or the current step. To achieve this, you can use the `exclude_from_loading` parameter to specify which keys should be excluded from loading.
67
-
This parameter takes a list of string that should be excluded from loading.
67
+
68
+
### Torchtune
69
+
70
+
This guide will walk you through the steps required to convert a checkpoint from torchtitan so that it can be loaded into torchtune.
When used in command line, the parameter should be a comma-separated list of strings. For example: `--checkpoint.exclude_from_loading data_loader,lr_scheduler`.
81
+
82
+
2. SAVE THE FINAL CHECKPOINT\
83
+
Once the above have been set, the final checkpoint at the end of the training step will consist of model only with the desired export dtype. However, if the final step has not been reached yet, full checkpoints will still be saved so that training can be resumed.
84
+
85
+
3. CONVERT SHARDED CHECKPOINTS TO A SINGLE FILE\
86
+
Finally, once you have obtained the last checkpoint, you can use the following command to convert the sharded checkpoints to a single .pt file that can be loaded into torchtune:
That's it. You have now successfully converted a sharded torchtitan checkpoint for use in torchtune.
76
94
95
+
### HuggingFace
96
+
TorchTitan supports two methods now for supporting huggingface, directly saving and loading a hf checkpoint during training, or using an example conversion script to directly reformat the weights.
77
97
78
-
## How to create a seed checkpoint
98
+
1. You can directly save huggingface model weights during training by using the `--checkpoint.last_save_in_safetensors_format` and `--checkpoint.last_save_model_only` options together. To directly load a torchtitan training session from a huggingface safetensors file, simply enable `--checkpoint.initial_load_model_only` and set `--checkpoint.initial_load_path` to the directory containing the huggingface checkpoint.
99
+
100
+
2. To directly reformat the weights without the need to run a training loop, run the corresponding conversion script. The naming scheme is torchtitan-centric, e.g. convert_from_hf means convert hf->tt.
Sometimes one needs to create a seed checkpoint to initialize a model from step 0.
80
111
E.g. it is hard, if not impossible, for meta initialization on multiple devices to reproduce the initialization on a single device.
81
112
A seed checkpoint does initialization of the model on a single CPU, and can be loaded from another job on an arbitrary number of GPUs via DCP resharding.
## How to load / save a checkpoint in HF safetensors format
91
-
For save, users need to set `--checkpoint.last_save_in_safetensors_format` and `--checkpoint.last_save_model_only` to save the last checkpoint in HF format (intermediate ones are always in DCP format).
92
-
For load, users need to either put the checkpoint in the `step-0` folder if using `--checkpoint.folder`, or specify `--checkpoint.initial_load_path` to load from a different folder. They also need to set `--checkpoint.initial_load_model_only` to load the checkpoint in HF format.
0 commit comments