Used per-parameter FSDP #165

awgu · 2024-03-26T19:46:22Z

Numeric Parity
1D FSDP

Eager: 1k steps of minipile on 8 H100 GPUs, local batch size 8, sequence length 2048, AC/SAC, bf16 mixed precision, fp32 reduce-scatter
- FSDP1 (AC): 24.81% peak active, 33.82% peak reserved, 6100-6200 WPS
- FSDP1 (SAC): 52.98% peak active, 67.23% peak reserved, 6500-6700 WPS
- FSDP2 (AC): 23.92% peak active, 32.64% peak reserved, 6100-6300 WPS
- FSDP2 (SAC): 52.13% peak active, 62.51% peak reserved, 6600-6800 WPS
- Loss curves match between FSDP1 and FSDP2
- Memory numbers reported as percentage since that is how they are logged; can convert against 95.0396 GiB GPU memory
Compile: same setup as eager
- FSDP2 (AC), buffer reuse disabled: 28.72 GiB (30.22%) peak reserved, 7200-7500 WPS, 33% MFU
- FSDP2 (AC), buffer reuse enabled: 28.90 GiB (30.40%) peak reserved, 7200-7500 WPS, 33% MFU
- FSDP2 (SAC), buffer reuse enabled: 53.83 GiB (56.64%) peak reserved, 8100-8400 WPS, 36% MFU
- Loss curves slightly better than eager
- For fun -- how much can we push MFU?
  - If we use FSDP2 (SAC) with 16 local batch size (doubled), we get 88.23 GiB (92.84%) peak reserved, 8600 WPS, 38% MFU.
  - If we use FSDP2 (no AC) with 8 local batch size, we get 90.28 GiB (94.99%) peak reserved, 9100-9300 WPS, 40% MFU.
Why is FSDP2 faster? (1) fp32 reduce-scatter only uses one div kernel instead of two and (2), reshard_after_forward=False for the last transformer block

2D FSDP

Eager (2-way SP, 4-way FSDP): 1k steps of minipile on 8 H100 GPUs, local batch size 16 (to preserve global batch size), sequence length 2048, bf16 mixed precision, fp32 reduce-scatter
- FSDP2 (AC): 50.12% peak active, 60.97% peak reserved, 5800-5900 WPS
- FSDP2 (SAC): 76.49% peak active, 90.14% peak reserved, 6100-6300 WPS
Loss curves match 8-way FSDP
FSDP1 + SP has incorrect numerics due to the FSDP.clip_grad_norm_ not all-reducing over TP mesh dimension

Loss curves

Meta-Device Initialization

The PyTorch Core guideline is for module.reset_parameters() to only initialize parameters/buffers immediately owned by module (i.e. module.parameters(recurse=False) and module.buffers(recurse=False)).
This makes it challenging to specify custom initializations for core modules like nn.Linear and nn.Embedding. For example, in @lessw2020's depth-wise truncated normal initialization, the trunc_normal_ standard deviation depends on the layer ID, which is a property of the TransformerBlock but affects the child nn.Linears.
To disambiguate, I suggest avoiding the name reset_parameters() in the case that we violate the PyTorch Core guideline and instead use a different name (e.g. init_weights).

DCP & Save/Load

Tested 1D and 2D by specifying checkpoint_folder = "/tmp/checkpoint_andgu in the .toml, training until saving a checkpoint, terminating the run, and restarting the training to load the checkpoint -- the loss after loading looks reasonable

awgu · 2024-03-26T21:41:46Z

torchtrain/parallelisms/parallelize_llama.py

+                transformer_block = checkpoint_wrapper(
+                    transformer_block, job_config.activation_checkpoint
+                )
+            # As an optimization, do not reshard after forward for the last


I am open to not including this 'trick' since it might be confusing. The idea is that we basically can reshard_after_forward=False for the last transformer block for free.

tianyu-l

This is wonderful work!
Left some comments, some of which are my questions.

torchtrain/models/llama/model.py

tianyu-l · 2024-03-26T23:29:12Z

torchtrain/models/llama/model.py

@@ -333,13 +313,13 @@ def __init__(self, model_args: ModelArgs):
        super().__init__()
        self.model_args = model_args
        self.tok_embeddings = nn.Embedding(model_args.vocab_size, model_args.dim)
+        self.init_weights()


It seems self.init_weights() or self.reset_parameters() are called in all but Attention and FeedForward modules (probably because init_std is not available during __init__?).

This creates a bit inconsistency in terms of how many times a parameter/buffer is initialized. Does it make sense to unify the behavior, e.g. all init_weights() or reset_parameters() are called from parent other than the Transformer itself.

Following offline discussion, I changed it so that self.init_weights() is only called in Transformer.__init__() and not in any other __init__(). This meant one change to the RotaryEmbedding.__init__() to register the freqs_cis buffer. The rest remains the same.

tianyu-l · 2024-03-26T23:32:30Z

torchtrain/models/llama/model.py

@@ -359,6 +339,16 @@ def forward(self, tokens: torch.Tensor):
        freqs_cis = self.freqs_cis[0:seqlen]
        return h, freqs_cis

+    def init_weights(self):
+        if hasattr(self, "freqs_cis"):


Am I understanding correctly that currently, each branch of this if-else will be called once during meta init; and the first branch will be called again when model.init_weights() is called.

wanchaol

Looks great first pass! mainly have some confusions about meta init part

wanchaol · 2024-03-27T07:08:47Z

torchtrain/models/llama/model.py

@@ -207,19 +205,10 @@ def __init__(self, model_args: ModelArgs):
            model_args.n_heads * self.head_dim, model_args.dim, bias=False
        )

-    def reset_parameters(self, init_std):


Actually I have some confusions about the reset_parameters guideline, so reset_parameters is an optional method in nn.Module, and it does not "recursively" call into the submodule's reset_parameters call when calling the parent module's reset_parameters().

This means that if the guideline is that each module should ONLY be responsible to its own parameter, user have to loop all the submodules in the module tree and call them individually?

And if that's the case, if user decide to not recursively loop submodules, one can simply define reset_parameters to re-init its own parameters + its leaf module parameters just like we did previously (i.e. nn.Attention we can also re-init the q/k/v linears), so that user can simply call reset_parameters() on their defined root module's reset_parameters() function and not worrying about the attention layer wq/wk/wv be overriden by the builtin nn.Linear's reset_parameter call, since it would never call that. This might be sth user already doing as they might want to control how the submodule init works themselves?

Not sure if you get my question haha, am I missing sth there?

This means that if the guideline is that each module should ONLY be responsible to its own parameter, user have to loop all the submodules in the module tree and call them individually?

This is my understanding.

And if that's the case, if user decide to not recursively loop submodules, one can simply define reset_parameters to re-init its own parameters + its leaf module parameters just like we did previously (i.e. nn.Attention we can also re-init the q/k/v linears), so that user can simply call reset_parameters() on their defined root module's reset_parameters() function and not worrying about the attention layer wq/wk/wv be overriden by the builtin nn.Linear's reset_parameter call, since it would never call that. This might be sth user already doing as they might want to control how the submodule init works themselves?

I agree with the approach you are mentioning

if we ignore FSDP

if we are using FSDP1 and every weight init does not depend on the original tensor shape

It happens to be that the weight init used for the Llama model in torchtrain does not depend on the original tensor shape (namely, the weight init is elementwise). However, this may not be the case for other models (e.g. those that compute fan-in/fan-out), in which case this approach would silently sample from the incorrect distribution.

FSDP1 calls reset_parameters() before sharding.

The current approach is aligned with the core guideline, so for FullyShardedDataParallel(module), FSDP1 calls submodule.reset_parameters() for each managed submodule in module.modules() (managed is defined by excluding any nested FullyShardedDataParallel modules or their children). This is the only way to ensure that each parameter is initialized exactly once.

If a parent Attention module re-initialized its Q/K/V linear modules, then FSDP1 would initialize the Q/K/V linears twice (once from Linear.reset_parameters() and once from Attention.reset_parameters()). This can still give a valid probability distribution, but it could give different values for a fixed seed compared to if the Linear.reset_parameters() were skipped (e.g. if not using FSDP and just calling model.reset_parameters() on the root model). This is not a major problem since it does not mean incorrect randomness but is still worth mentioning.

If we further call model.reset_parameters() after sharding with FSDP1, then we have 1D flattened sharded tensors, which no longer preserve the original tensor shape. Therefore, calling model.reset_parameters() at this point will give incorrect randomness in cases depending on the shape.

In summary, following the core guideline is the only way to guarantee that each parameter is initialized once and before sharding. The constraint to initialize once is not required for correct randomness but may help reproducibility.

I see, ok this make sense, so it is critical to only initialize it once for reproducibility when starting a fixed seed.

At the same time though, the DTensor RNG will be different than local, so I am not sure if this reproducibility argument makes sense. We would not be able to ensure same results for FSDP2 compared to a single-GPU non-DTensor setup.

torchtrain/parallelisms/parallelize_llama.py

wanchaol

Nice work! lgtm :)

wanchaol · 2024-03-27T16:48:28Z

train.py

-                torch.nn.utils.clip_grad_norm_(
-                    model.parameters(), job_config.training.max_norm
-                )
+            torch.nn.utils.clip_grad_norm_(


I like the fact that it composes with existing impl instead of using a separate impl!

awgu · 2024-03-27T19:47:33Z

After pytorch/pytorch#122801 lands, the save/load with torch.compile should work. (I tested locally.)

tianyu-l

Looks great to me!

awgu · 2024-03-27T20:27:11Z

train.py

@@ -199,7 +197,6 @@ def main(job_config: JobConfig):

    # torch.compile model for improved performance
    if job_config.training.compile:
-        torch._inductor.config.allow_buffer_reuse = False


Since pytorch/pytorch#122444 landed, we can re-enable buffer reuse.

awgu · 2024-03-27T20:28:05Z

train.py

@@ -186,6 +179,11 @@ def main(job_config: JobConfig):
    model = models_parallelize_fns[model_name](
        model, world_mesh, parallel_dims, job_config
    )
+    # set this as required by DTensor to work with `to_empty`
+    # TODO: remove in the future when enabled by default for wrapper subclasses
+    torch.__future__.set_swap_module_params_on_conversion(True)


After pytorch/pytorch#122755, we can remove this call.

awgu · 2024-03-28T18:54:09Z

If anything breaks because of this PR, please ping me :)

awgu · 2024-03-28T22:13:33Z

Local batch size 6, torch.compile, bf16 mixed precision, no AC, reshard_after_forward=False for all transformer blocks, 8x H100s:
9250-9400 WPS, 40.9-41.5% MFU

@lessw2020

**Numeric Parity** 1D FSDP - Eager: 1k steps of minipile on 8 H100 GPUs, local batch size 8, sequence length 2048, AC/SAC, bf16 mixed precision, fp32 reduce-scatter - FSDP1 (AC): 24.81% peak active, 33.82% peak reserved, 6100-6200 WPS - FSDP1 (SAC): 52.98% peak active, 67.23% peak reserved, 6500-6700 WPS - FSDP2 (AC): 23.92% peak active, 32.64% peak reserved, 6100-6300 WPS - FSDP2 (SAC): 52.13% peak active, 62.51% peak reserved, 6600-6800 WPS - Loss curves match between FSDP1 and FSDP2 - Memory numbers reported as percentage since that is how they are logged; can convert against 95.0396 GiB GPU memory - Compile: same setup as eager - FSDP2 (AC), buffer reuse disabled: 28.72 GiB (30.22%) peak reserved, 7200-7500 WPS, 33% MFU - FSDP2 (AC), buffer reuse enabled: 28.90 GiB (30.40%) peak reserved, 7200-7500 WPS, 33% MFU - FSDP2 (SAC), buffer reuse enabled: 53.83 GiB (56.64%) peak reserved, 8100-8400 WPS, 36% MFU - Loss curves slightly better than eager - For fun -- how much can we push MFU? - If we use FSDP2 (SAC) with 16 local batch size (doubled), we get 88.23 GiB (92.84%) peak reserved, 8600 WPS, 38% MFU. - If we use FSDP2 (no AC) with 8 local batch size, we get 90.28 GiB (94.99%) peak reserved, 9100-9300 WPS, 40% MFU. - Why is FSDP2 faster? (1) fp32 reduce-scatter only uses one div kernel instead of two and (2), `reshard_after_forward=False` for the last transformer block 2D FSDP - Eager (2-way SP, 4-way FSDP): 1k steps of minipile on 8 H100 GPUs, local batch size 16 (to preserve global batch size), sequence length 2048, bf16 mixed precision, fp32 reduce-scatter - FSDP2 (AC): 50.12% peak active, 60.97% peak reserved, 5800-5900 WPS - FSDP2 (SAC): 76.49% peak active, 90.14% peak reserved, 6100-6300 WPS - Loss curves match 8-way FSDP - FSDP1 + SP has incorrect numerics due to the `FSDP.clip_grad_norm_` not all-reducing over TP mesh dimension <details> <summary> Loss curves </summary> <img width="732" alt="Screenshot 2024-03-26 at 3 31 19 PM" src="https://github.com/pytorch/torchtrain/assets/31054793/59ec71cc-ad0a-4dd1-b5c6-a8cbf9ab5e85"> </details> **Meta-Device Initialization** - The PyTorch Core guideline is for `module.reset_parameters()` to only initialize parameters/buffers immediately owned by `module` (i.e. `module.parameters(recurse=False)` and `module.buffers(recurse=False)`). - This makes it challenging to specify custom initializations for core modules like `nn.Linear` and `nn.Embedding`. For example, in @lessw2020's depth-wise truncated normal initialization, the `trunc_normal_` standard deviation depends on the layer ID, which is a property of the `TransformerBlock` but affects the child `nn.Linear`s. - To disambiguate, I suggest avoiding the name `reset_parameters()` in the case that we violate the PyTorch Core guideline and instead use a different name (e.g. `init_weights`). **DCP & Save/Load** - Tested 1D and 2D by specifying `checkpoint_folder = "/tmp/checkpoint_andgu` in the `.toml`, training until saving a checkpoint, terminating the run, and restarting the training to load the checkpoint -- the loss after loading looks reasonable

@lessw2020

**Numeric Parity** 1D FSDP - Eager: 1k steps of minipile on 8 H100 GPUs, local batch size 8, sequence length 2048, AC/SAC, bf16 mixed precision, fp32 reduce-scatter - FSDP1 (AC): 24.81% peak active, 33.82% peak reserved, 6100-6200 WPS - FSDP1 (SAC): 52.98% peak active, 67.23% peak reserved, 6500-6700 WPS - FSDP2 (AC): 23.92% peak active, 32.64% peak reserved, 6100-6300 WPS - FSDP2 (SAC): 52.13% peak active, 62.51% peak reserved, 6600-6800 WPS - Loss curves match between FSDP1 and FSDP2 - Memory numbers reported as percentage since that is how they are logged; can convert against 95.0396 GiB GPU memory - Compile: same setup as eager - FSDP2 (AC), buffer reuse disabled: 28.72 GiB (30.22%) peak reserved, 7200-7500 WPS, 33% MFU - FSDP2 (AC), buffer reuse enabled: 28.90 GiB (30.40%) peak reserved, 7200-7500 WPS, 33% MFU - FSDP2 (SAC), buffer reuse enabled: 53.83 GiB (56.64%) peak reserved, 8100-8400 WPS, 36% MFU - Loss curves slightly better than eager - For fun -- how much can we push MFU? - If we use FSDP2 (SAC) with 16 local batch size (doubled), we get 88.23 GiB (92.84%) peak reserved, 8600 WPS, 38% MFU. - If we use FSDP2 (no AC) with 8 local batch size, we get 90.28 GiB (94.99%) peak reserved, 9100-9300 WPS, 40% MFU. - Why is FSDP2 faster? (1) fp32 reduce-scatter only uses one div kernel instead of two and (2), `reshard_after_forward=False` for the last transformer block 2D FSDP - Eager (2-way SP, 4-way FSDP): 1k steps of minipile on 8 H100 GPUs, local batch size 16 (to preserve global batch size), sequence length 2048, bf16 mixed precision, fp32 reduce-scatter - FSDP2 (AC): 50.12% peak active, 60.97% peak reserved, 5800-5900 WPS - FSDP2 (SAC): 76.49% peak active, 90.14% peak reserved, 6100-6300 WPS - Loss curves match 8-way FSDP - FSDP1 + SP has incorrect numerics due to the `FSDP.clip_grad_norm_` not all-reducing over TP mesh dimension <details> <summary> Loss curves </summary> <img width="732" alt="Screenshot 2024-03-26 at 3 31 19 PM" src="https://github.com/pytorch/torchtrain/assets/31054793/59ec71cc-ad0a-4dd1-b5c6-a8cbf9ab5e85"> </details> **Meta-Device Initialization** - The PyTorch Core guideline is for `module.reset_parameters()` to only initialize parameters/buffers immediately owned by `module` (i.e. `module.parameters(recurse=False)` and `module.buffers(recurse=False)`). - This makes it challenging to specify custom initializations for core modules like `nn.Linear` and `nn.Embedding`. For example, in @lessw2020's depth-wise truncated normal initialization, the `trunc_normal_` standard deviation depends on the layer ID, which is a property of the `TransformerBlock` but affects the child `nn.Linear`s. - To disambiguate, I suggest avoiding the name `reset_parameters()` in the case that we violate the PyTorch Core guideline and instead use a different name (e.g. `init_weights`). **DCP & Save/Load** - Tested 1D and 2D by specifying `checkpoint_folder = "/tmp/checkpoint_andgu` in the `.toml`, training until saving a checkpoint, terminating the run, and restarting the training to load the checkpoint -- the loss after loading looks reasonable

* Load missing keys default from argparse (#111) ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] ***************************************** W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] ***************************************** [rank0]:2024-03-04 17:01:28,834 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank1]:2024-03-04 17:01:28,857 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-03-04 17:01:29,712 - root - INFO - Starting job: debug training [rank0]:2024-03-04 17:01:29,712 - root - INFO - Building llama [rank0]:2024-03-04 17:01:29,719 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-04 17:01:29,719 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-03-04 17:01:31,187 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-03-04 17:01:31,188 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-03-04 17:01:31,346 - root - INFO - Model fully initialized via reset_params [rank0]:2024-03-04 17:01:31,346 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-04 17:01:31,347 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-04 17:01:31,347 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-03-04 17:01:32,502 - root - INFO - Applied FSDP to the model... [rank0]:2024-03-04 17:01:32,503 - root - INFO - Gradient scaling not enabled. [rank0]:2024-03-04 17:01:32,504 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240304-1701. [rank0]:2024-03-04 17:01:32,901 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-03-04 17:01:34,806 - root - INFO - �[36mstep: 1 �[32mloss: 10.8424 �[39miter: �[34m 1.8688�[39m data: �[34m0.0316 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-03-04 17:01:34,891 - root - INFO - �[36mstep: 2 �[32mloss: 10.7581 �[39miter: �[34m 0.0476�[39m data: �[34m0.0357 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-03-04 17:01:34,970 - root - INFO - �[36mstep: 3 �[32mloss: 10.6239 �[39miter: �[34m 0.045�[39m data: �[34m0.0333 �[39mlr: �[33m0.0008�[39m [rank0]:2024-03-04 17:01:35,048 - root - INFO - �[36mstep: 4 �[32mloss: 10.4163 �[39miter: �[34m 0.0455�[39m data: �[34m0.0323 �[39mlr: �[33m0.0007�[39m [rank0]:2024-03-04 17:01:35,127 - root - INFO - �[36mstep: 5 �[32mloss: 10.1529 �[39miter: �[34m 0.0459�[39m data: �[34m0.032 �[39mlr: �[33m0.0006�[39m [rank0]:2024-03-04 17:01:35,206 - root - INFO - �[36mstep: 6 �[32mloss: 9.8899 �[39miter: �[34m 0.0468�[39m data: �[34m0.0311 �[39mlr: �[33m0.0005�[39m [rank0]:2024-03-04 17:01:35,284 - root - INFO - �[36mstep: 7 �[32mloss: 9.7204 �[39miter: �[34m 0.0461�[39m data: �[34m0.0312 �[39mlr: �[33m0.0004�[39m [rank0]:2024-03-04 17:01:35,425 - root - INFO - �[36mstep: 8 �[32mloss: 9.3757 �[39miter: �[34m 0.0457�[39m data: �[34m0.0319 �[39mlr: �[33m0.0003�[39m [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-03-04 17:01:35,537 - root - INFO - �[36mstep: 9 �[32mloss: 9.1883 �[39miter: �[34m 0.0762�[39m data: �[34m0.0318 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-03-04 17:01:35,958 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-03-04 17:01:35,971 - root - INFO - �[36mstep: 10 �[32mloss: 9.1212 �[39miter: �[34m 0.0808�[39m data: �[34m0.0319 �[39mlr: �[33m0.0001�[39m [rank0]:2024-03-04 17:01:35,972 - root - INFO - Average iter time: 0.0553 seconds [rank0]:2024-03-04 17:01:35,972 - root - INFO - Average data load time: 0.0317 seconds [rank0]:2024-03-04 17:01:35,972 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com> * Add meta_init, enable it as default init process (#84) This PR enables meta_init functionality to avoid OOM'ing on cpu for larger models. The core functionality is in meta_init.py, and a few changes in parallelization and train.py. Key items: 1 - this is largely the same as the earlier PR I had for meta_init, but I did a new one b/c faster than reworking it with all the interim changes. 2 - to address feedback in previous PR: a - why do we need meta_init.py, can't we just do: ~~~ with torch.device("meta"): model = Model.from_args(...) ~~~ Unfortunately this does not work b/c the rope embeddings are treated differently (buffer) and thus the simple lambda call from param_init_fn in FSDP (lambda module: module.to_device('cuda') ) will not invoke or move the rope embeddings and the model will fail on first forward. This issue relates to the nn.embeddings not being moved, and that the device is referenced in the forward pass for the current rope class. Have opened https://github.com/pytorch/torchtrain/issues/110 to track this and investigate while not holding up meta init that is working from landing. b - per earlier feedback - meta init is now 'not optional' but simply the default. This should ensure all models leverage it and ensure we aren't missing things for future meta_init aspects. 3 - misc change - I switched the model_params to just do the normal all params count instead of 'unique params' b/c it does not mesh with what people perceive model size as. Testing: tested both debugmodel and 26B model with and without meta init to confirm same loss curves. Note for future reference - if you get a bad init (meta init failure) you will simply not train (loss is same every iter). If you fail to call reset params after FSDP, then you will train (b/c we default to torch.randn_like) but your starting loss will be 5x+ higher (telling you that you have not properly init'ed the model). * Fix feedback from PR 111 (#113) Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com> * fix SP minor issues ghstack-source-id: 5133a8d97ad209b569e0fc528e58daafdd31d80d Pull Request resolved: https://github.com/pytorch/torchtrain/pull/114 * enable loss parallel in SP ghstack-source-id: a0c8b4454f75ad1cd9824ac89a1df0182f6a7d8c Pull Request resolved: https://github.com/pytorch/torchtrain/pull/112 * Float8_experimental option for training (#102) * add miniPile dataset for pretraining, 1M entries (solves the 'out of data' at 40 iters issue) (#88) This PR add's minipile (1M, 6GB) dataset as an option for pretraining with torchtrain. It resolves the issue where we run out of data after 40 iterations with the default alpaca dataset. Per @tianyu-l's excellent suggestion, have refactored to have a single hf_datasets.py file that supports both minipile and alpaca since it turned out no need for any different tokenizer, etc. Also cleaned up the datasets package so that create_tokenizer is exposed directly, and thus all public apis can be used directly from torchtrain.datasets. Lastly - added warning if/when a dataset is being re-looped so users don't get burned by overfitting: <img width="1294" alt="Screenshot 2024-03-06 at 5 11 09 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/82480b6f-c677-4794-80c5-5c10b037732a"> Adds a color highlight to showcase what dataloader was built: <img width="1360" alt="Screenshot 2024-03-05 at 9 19 10 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/4717ec6a-14bb-4283-a3ae-fa40c27deee0"> and <img width="1360" alt="Screenshot 2024-03-05 at 9 22 01 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/dbf32d51-2dd4-4526-8855-9b33b627559e"> Usage: just add "minipile" or "alpaca" as the dataset in the training config toml file. <img width="439" alt="Screenshot 2024-02-25 at 12 35 26 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/1afbaed1-07f8-4e37-b8cc-80190db7fb27"> Testing: verified training loss is improving and ran for 100 iters to verify no issue with out of data any longer with minipile. reran with alpaca and saw the expected out of data at 40 iters without infinite loop option, runs to 100 with infinite. Notes: I did not make this a default dataset since for debugmodel, mostly running 10 iters is fine and there's 6GB to pull down. <img width="869" alt="Screenshot 2024-02-25 at 12 30 29 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/1070a80a-ad20-4f0f-a860-e13caa3120a0"> * add data loading option to load from local file system ghstack-source-id: 3c930054d3b04faf3866048740a2ef887d066dd6 Pull Request resolved: https://github.com/pytorch/torchtrain/pull/117 * add llama 13B configs ghstack-source-id: 733bf85716cda3a5b9af780eba79c9b5dd66abad Pull Request resolved: https://github.com/pytorch/torchtrain/pull/121 * add llama 70B toml ghstack-source-id: d7cd26d84aa2442ac45223992e1766397e52c8d8 Pull Request resolved: https://github.com/pytorch/torchtrain/pull/122 * set betas and weight decay for optimizers according to suggestions in https://github.com/pytorch/torchtrain/issues/118#issuecomment-1986470746 ghstack-source-id: 357f0872cd1c9bad2c4c256d47adbd3f716a7651 Pull Request resolved: https://github.com/pytorch/torchtrain/pull/123 * Add c4 dataset (177M, streaming), update multi-node support for latest job configs (#124) This PR: 1 - adds the english language portion of c4 dataset, which has 177M entries. (https://huggingface.co/datasets/allenai/c4) Due to the size, streaming is enabled as the default. This is the allen-ai/c4, as apparently the original c4 is being deprecated and HF advises to use allen-ai now. For comparison per @tianyu-l request - 40 iterations avg time: alpaca cached loading: Average data load time: 0.0279 seconds c4 streaming loading: Average data load time: 0.0290 seconds There is a longer initial delay during the 'preparing c4' vs alpaca (i.e. 45 seconds vs 10 seconds), but after that speed is similar. Dataset sample (not displayed in training, just an excerpt I pulled to double check the data flow): <img width="1233" alt="Screenshot 2024-03-08 at 5 31 06 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/94915f80-da70-48d1-8c43-43f874fef121"> 2 - I also updated the multi-node slurm file to account for the new job config. Test: verified no looping with 100 iterations, sampled data streamed to verify. * Add openwebtext dataset for larger scale training without shuffling (#130) This PR adds the openwebtext 1M dataset. This is a homogenous dataset, so we are able to train successfully while not having any shuffle in our dataset loader. 1 - adds the dateset to hf_datasets 2 - makes the default dataset for 13b and 70b as openwebtext since that is the preferred choice for larger scale training. Testing - ran 5K iters (9 nodes) to verify no spiking issues: <img width="787" alt="Screenshot 2024-03-12 at 9 50 57 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/420fa1fc-50f8-47bc-9b07-02c8fa132e7c"> * [TorchTrain][Checkpoint] Fix TrainState state_dict to unblock loading (#131) This fix would temporarily unblock loading. So we won't run into the issue of: ``` [rank0]:[rank0]: train_state.losses.append(train_state.current_loss) [rank0]:[rank0]: AttributeError: 'float' object has no attribute 'append' ``` However, current_loss and losses are still not correct, since by current setup, losses and current_losses would be different across different ranks. Also, we don't know the size of losses because this is based on the # of steps. So loading still work but the value of current_loss and losses are not being loaded correctly. I will follow up with further fixes. * improve logging ghstack-source-id: de61ec093b43a2ccbf1156c76ba81ecd698a6a8a Pull Request resolved: https://github.com/pytorch/torchtrain/pull/132 * use SequenceParallel style in tp/sp (#133) simplify things given we already have SequenceParallel style landed in main * support TP-only parallelism ghstack-source-id: c13ebb8de8e8e9203624b5dd710a046d17311b0f Pull Request resolved: https://github.com/pytorch/torchtrain/pull/137 * disable verbose print from profiling ghstack-source-id: ca6eb8f42bf3c2a59d8e6389e7fe94ed85103099 Pull Request resolved: https://github.com/pytorch/torchtrain/pull/136 * add Selective layer activation checkpointing, single control for turning AC on or off. (#125) This PR: 1 - adds selective layer checkpointing - this lets the user select every x layer to checkpoint: i.e. 2 = every other layer is checkpointed. spec for config was updated by Wanchao - so we now have this layout for AC which is hopefully self-explanatory (covers None, full, Selective Op or Selective Layer and layer filtering policy: <img width="941" alt="Screenshot 2024-03-13 at 6 09 52 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/4b992286-1fbd-4a14-957a-4325f81a9ab4"> Thus, it lets user toggle between traditional 'all layers' to more and more fine grained checkpointing. Note that I implemented this for IBM last summer and in their llama testing, every 2nd layer was the best bang/buck so I have made that the default. 2 - the config file has been updated to make a new [activation_checkpointing] section and make it easier to modify vs being dumped into the training section. Testing and results: I tested all the AC options to ensure all options are working, and that we fail if both types are set to true in config: <img width="608" alt="Screenshot 2024-03-09 at 3 43 52 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/e3c20fbf-73e2-492d-9fb9-f32e772e239e"> * remove per iter syncronize ghstack-source-id: 581c9115e89d3de57e558175b527c12c06a6808c Pull Request resolved: https://github.com/pytorch/torchtrain/pull/134 * Shorten nccl comm timeout and enable flight recorder dumping (#103) Timeout ------- It's convenient whether during iterative debugging or long running training to find out asap about a failure. The default timeout is way too long and leads to wasted cluster time or developer frustration. Timeout can be adjusted via cmdline or in .toml if it needs to be larger for a particular model. Another useful pattern can be to set a large timeout for initialization and then tighten it after iteration 1. We can add this later if desired. Ideally we could pass the timeout to the device mesh ctor, but it's not ready yet. Also, we can change timeouts of the existing PGs after creating them, but that's more LOC and not necessary unless we want to change the timeouts at runtime. Dumps ----- Dumping on timeout should be a safe default for everyone. It has the side-effect of requiring a dump path which defaults to ~/pgnccl_dump but can be overridden via DUMP_PATH env. The raw content of the dump is a pickle that is intended to be consumed through scripts/tools which are under development, so it may not be easy to know how to use these for now. As the tooling matures, we should provide reference docs and probably print out pointers in the logs when we perform the dump. Test plan: tested locally by adding a rank0 sleep for 10sec inside the training loop, validating all 8 ranks dumped a trace. * fix up gpu memory monitoring and logging ghstack-source-id: 2f79d081c7724dbc34f357913671e8aefdf437b1 Pull Request resolved: https://github.com/pytorch/torchtrain/pull/147 * Separate timeout during init and training (#149) Allow a tighter timeout during training than during init. Init includes the first train step, as well as any loading and setup. It can be slower and less predictable due to various factors including lazy initialization or jit compilation. After the first train step, we expect more predictable runtime and benefit from a tighter timeout to give quick feedback on a hang. Tested by pasting this code in 2 places ``` if dp_mesh.get_local_rank() == 0 and train_state.step == 1: import time time.sleep(10) ``` (a) before calling set_pg_timeout, which did not cause a timeout (b) after calling set_pg_timeout, which timed out * Update activation check with updates to config manager (#152) * Refactor to clean up parallelisms/__init__.py (second attempt, didn't land correctly) ghstack-source-id: 3dfec3ed134105cc5a951f8db160c8c2a9b3349b Pull Request resolved: https://github.com/pytorch/torchtrain/pull/154 * enable gc control scheduling to help avoid stragglers (#148) This PR adds control over Python garbage collection to help avoid stragglers during large scale training. updates - this feature is now exposed as a controllable option gc_schedule, with a default of 50. 0 = not enabled. int = schedules gc at every int iters during training loop. <img width="1078" alt="Screenshot 2024-03-15 at 12 39 26 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/1ee387c5-f0a6-4366-936c-a1e281dad88f"> Effectively we disable the gc, run one collection to ensure a good starting point, and then at the start of each gc_schedule iter, we call gc to free up things. By enforcing a fixed schedule for collection, it helps all ranks stay more in synch. Point of reference - on 512 GPU FSDP, adding this (gc_schedule=1) gave a perf boost of ~1.5% per iter just by virtue of better synch. (this was originally developed during dist compiler to resolve stragglers, I believe @fegin came up with this solution). * Add float8 specific parallel strategies (#153) * add MFU to metrics ghstack-source-id: 995efd6f460f3fe83ecf8d72c2178554f325485b Pull Request resolved: https://github.com/pytorch/torchtrain/pull/151 * disable buffer reuse for compile for now (#156) disable buffer reuse for compile to have close numerics to eager mode, as suggested by @Chillee This is probably only a temp change until buff reuse fix in inductor * refactor config manager and support cmd overrides (#157) This PR supports explicit cmd overrides, to allow infra layers to override certain options (the most important one is dump_folder) * Add support for generating debug traces on failure * rename sequence_parallel to tensor_parallel (#162) This PR renames sequence_parallel to tensor_parallel, as sequence parallel is only applied to rmsnorm layers, a broader name should be tensor_parallel, maybe with sequence_parallel enabled ghstack broken :( so using direct branch push instead * add basic AC configs for 13B and 70B (#169) as titled, currently 13B use selective op, and 70B use selective layer, we can do some more experiments and adjust the configs later * [TorchTrain][Checkpoint] Update train state to include global_avg_losses and global_max_losses (#167) Based on discussion with @tianyu-l, we decided to only checkpoint `global_avg_losses` and `global_max_losses` per log frequency iteration to avoid all_reduce and device sync every iteration. `TrainState.current_loss` and `TrainState.losses` are removed from TrainState `state_dict()` and `load_state_dict()` call. Tested with saving/loading with 30 steps with log_frequency = 10 and loading with 40 steps to resume training. The numerics in global_avg_losses and global_max_losses in the list aligns with expected. ``` Step 30 save: [rank0]:before save: self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21]) Step 30 load: [rank0]:after load: self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21]) Step 40 load and resume training: [rank0]:before save: self.states['train_state']=TrainState(step=40, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945, 5.596909999847412], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555, 5.6796345710754395], log_steps=[1, 11, 21, 31]) ``` * Basic integration test infra (#170) Summary: PR adds an option `use_for_integration_test`. when set to `True`, this adds the config to the integration test suite. A test runner picks all the configs marked for integration test and run them. Test Plan: ``` =====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] ***************************************** W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] ***************************************** [rank0]:2024-03-27 09:46:32,214 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-03-27 09:46:32,372 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-03-27 09:46:32,375 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-03-27 09:46:32,377 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-27 09:46:32,384 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-03-27 09:46:32,384 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-03-27 09:46:34,015 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-27 09:46:34,024 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-27 09:46:34,025 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied FSDP to the model [rank0]:2024-03-27 09:46:34,171 - root - INFO - Model fully initialized via reset_parameters [rank0]:2024-03-27 09:46:34,171 - root - INFO - Gradient scaling not enabled [rank0]:2024-03-27 09:46:34,171 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-0946 [rank0]:2024-03-27 09:46:34,809 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.) [rank0]: warnings.warn( [rank0]:2024-03-27 09:46:35,627 - root - INFO - �[36mstep: 1 �[32mloss: 10.9486 �[33mmemory: 9.42GiB(9.91%) �[34mwps: 20,066 �[35mmfu: 0.25%�[39m [rank0]:2024-03-27 09:46:35,627 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-03-27 09:46:35,705 - root - INFO - �[36mstep: 2 �[32mloss: 10.8786 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 212,046 �[35mmfu: 2.60%�[39m [rank0]:2024-03-27 09:46:35,786 - root - INFO - �[36mstep: 3 �[32mloss: 10.7362 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 204,441 �[35mmfu: 2.50%�[39m [rank0]:2024-03-27 09:46:35,863 - root - INFO - �[36mstep: 4 �[32mloss: 10.5094 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,800 �[35mmfu: 2.66%�[39m [rank0]:2024-03-27 09:46:35,939 - root - INFO - �[36mstep: 5 �[32mloss: 10.2755 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,527 �[35mmfu: 2.65%�[39m [rank0]:2024-03-27 09:46:36,016 - root - INFO - �[36mstep: 6 �[32mloss: 10.0318 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 214,117 �[35mmfu: 2.62%�[39m [rank0]:2024-03-27 09:46:36,093 - root - INFO - �[36mstep: 7 �[32mloss: 9.7929 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,509 �[35mmfu: 2.65%�[39m [rank0]:2024-03-27 09:46:36,192 - root - INFO - �[36mstep: 8 �[32mloss: 9.5539 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 166,639 �[35mmfu: 2.04%�[39m [rank0]:2024-03-27 09:46:36,329 - root - INFO - �[36mstep: 9 �[32mloss: 9.3909 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 120,381 �[35mmfu: 1.47%�[39m [rank0]:[rank0]:[W327 09:46:36.744143018 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-03-27 09:46:36,409 - root - INFO - �[36mstep: 10 �[32mloss: 9.2749 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 207,613 �[35mmfu: 2.54%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com> * Add 2D integration test (FSDP + TP) (#171) Summary: Add a 2D test to integration test suite Test Plan: ``` =====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] ***************************************** W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] ***************************************** [rank0]:2024-03-27 14:29:49,466 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-03-27 14:29:49,615 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-03-27 14:29:49,621 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-03-27 14:29:49,623 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-27 14:29:49,630 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-03-27 14:29:49,630 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-03-27 14:29:51,114 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-27 14:29:51,124 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-27 14:29:51,124 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied FSDP to the model [rank0]:2024-03-27 14:29:51,284 - root - INFO - Model fully initialized via reset_parameters [rank0]:2024-03-27 14:29:51,284 - root - INFO - Gradient scaling not enabled [rank0]:2024-03-27 14:29:51,285 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1429 [rank0]:2024-03-27 14:29:52,056 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.) [rank0]: warnings.warn( [rank0]:2024-03-27 14:29:52,825 - root - INFO - �[36mstep: 1 �[32mloss: 10.7425 �[33mmemory: 9.42GiB(9.91%) �[34mwps: 21,337 �[35mmfu: 0.26%�[39m [rank0]:2024-03-27 14:29:52,825 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-03-27 14:29:52,905 - root - INFO - �[36mstep: 2 �[32mloss: 10.6722 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 208,060 �[35mmfu: 2.55%�[39m [rank0]:2024-03-27 14:29:52,982 - root - INFO - �[36mstep: 3 �[32mloss: 10.5435 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 213,622 �[35mmfu: 2.62%�[39m [rank0]:2024-03-27 14:29:53,060 - root - INFO - �[36mstep: 4 �[32mloss: 10.3359 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 212,856 �[35mmfu: 2.61%�[39m [rank0]:2024-03-27 14:29:53,139 - root - INFO - �[36mstep: 5 �[32mloss: 10.0965 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 209,326 �[35mmfu: 2.56%�[39m [rank0]:2024-03-27 14:29:53,215 - root - INFO - �[36mstep: 6 �[32mloss: 9.8806 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 216,808 �[35mmfu: 2.66%�[39m [rank0]:2024-03-27 14:29:53,292 - root - INFO - �[36mstep: 7 �[32mloss: 9.6442 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 214,874 �[35mmfu: 2.63%�[39m [rank0]:2024-03-27 14:29:53,367 - root - INFO - �[36mstep: 8 �[32mloss: 9.4349 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 220,877 �[35mmfu: 2.70%�[39m [rank0]:2024-03-27 14:29:53,500 - root - INFO - �[36mstep: 9 �[32mloss: 9.2674 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 123,924 �[35mmfu: 1.52%�[39m [rank0]:[rank0]:[W327 14:29:53.248291822 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-03-27 14:29:53,577 - root - INFO - �[36mstep: 10 �[32mloss: 9.1404 �[33mmemory: 11.38GiB(11.97%) �[34mwps: 214,910 �[35mmfu: 2.63%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 =====Integration test: CONFIG_FILE=./train_configs/debug_model_2d.toml NGPU=4 ./run_llama_train.sh===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model_2d.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_2d.toml W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] ***************************************** W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] ***************************************** [rank0]:2024-03-27 14:30:00,872 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-03-27 14:30:01,177 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-03-27 14:30:01,182 - root - INFO - Building 2-D device mesh with ['dp', 'tp'], [2, 2] [rank0]:2024-03-27 14:30:01,185 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-27 14:30:01,194 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-03-27 14:30:01,195 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-03-27 14:30:02,807 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-27 14:30:02,818 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-27 14:30:02,819 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-03-27 14:30:02,830 - root - INFO - Applied Sequence Parallelism to the model [rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied FSDP to the model [rank0]:2024-03-27 14:30:03,004 - root - INFO - Model fully initialized via reset_parameters [rank0]:2024-03-27 14:30:03,004 - root - INFO - Gradient scaling not enabled [rank0]:2024-03-27 14:30:03,005 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1430 [rank0]:2024-03-27 14:30:03,642 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.) [rank0]: warnings.warn( [rank0]:2024-03-27 14:30:04,528 - root - INFO - �[36mstep: 1 �[32mloss: 10.8502 �[33mmemory: 5.71GiB(6.01%) �[34mwps: 9,259 �[35mmfu: 0.11%�[39m [rank0]:2024-03-27 14:30:04,528 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-03-27 14:30:04,679 - root - INFO - �[36mstep: 2 �[32mloss: 10.7671 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 54,430 �[35mmfu: 0.67%�[39m [rank0]:2024-03-27 14:30:04,773 - root - INFO - �[36mstep: 3 �[32mloss: 10.6390 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 88,457 �[35mmfu: 1.08%�[39m [rank0]:2024-03-27 14:30:04,864 - root - INFO - �[36mstep: 4 �[32mloss: 10.4210 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 90,384 �[35mmfu: 1.11%�[39m [rank0]:2024-03-27 14:30:04,954 - root - INFO - �[36mstep: 5 �[32mloss: 10.1648 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 93,058 �[35mmfu: 1.14%�[39m [rank0]:2024-03-27 14:30:05,067 - root - INFO - �[36mstep: 6 �[32mloss: 9.9451 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 72,642 �[35mmfu: 0.89%�[39m [rank0]:2024-03-27 14:30:05,165 - root - INFO - �[36mstep: 7 �[32mloss: 9.7004 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 85,096 �[35mmfu: 1.04%�[39m [rank0]:2024-03-27 14:30:05,251 - root - INFO - �[36mstep: 8 �[32mloss: 9.4422 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 95,860 �[35mmfu: 1.17%�[39m [rank0]:2024-03-27 14:30:05,399 - root - INFO - �[36mstep: 9 �[32mloss: 9.2144 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 55,837 �[35mmfu: 0.68%�[39m [rank0]:[rank0]:[W327 14:30:05.148473462 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-03-27 14:30:05,496 - root - INFO - �[36mstep: 10 �[32mloss: 9.1710 �[33mmemory: 6.69GiB(7.04%) �[34mwps: 86,136 �[35mmfu: 1.05%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com> * Used per-parameter FSDP (#165) **Numeric Parity** 1D FSDP - Eager: 1k steps of minipile on 8 H100 GPUs, local batch size 8, sequence length 2048, AC/SAC, bf16 mixed precision, fp32 reduce-scatter - FSDP1 (AC): 24.81% peak active, 33.82% peak reserved, 6100-6200 WPS - FSDP1 (SAC): 52.98% peak active, 67.23% peak reserved, 6500-6700 WPS - FSDP2 (AC): 23.92% peak active, 32.64% peak reserved, 6100-6300 WPS - FSDP2 (SAC): 52.13% peak active, 62.51% peak reserved, 6600-6800 WPS - Loss curves match between FSDP1 and FSDP2 - Memory numbers reported as percentage since that is how they are logged; can convert against 95.0396 GiB GPU memory - Compile: same setup as eager - FSDP2 (AC), buffer reuse disabled: 28.72 GiB (30.22%) peak reserved, 7200-7500 WPS, 33% MFU - FSDP2 (AC), buffer reuse enabled: 28.90 GiB (30.40%) peak reserved, 7200-7500 WPS, 33% MFU - FSDP2 (SAC), buffer reuse enabled: 53.83 GiB (56.64%) peak reserved, 8100-8400 WPS, 36% MFU - Loss curves slightly better than eager - For fun -- how much can we push MFU? - If we use FSDP2 (SAC) with 16 local batch size (doubled), we get 88.23 GiB (92.84%) peak reserved, 8600 WPS, 38% MFU. - If we use FSDP2 (no AC) with 8 local batch size, we get 90.28 GiB (94.99%) peak reserved, 9100-9300 WPS, 40% MFU. - Why is FSDP2 faster? (1) fp32 reduce-scatter only uses one div kernel instead of two and (2), `reshard_after_forward=False` for the last transformer block 2D FSDP - Eager (2-way SP, 4-way FSDP): 1k steps of minipile on 8 H100 GPUs, local batch size 16 (to preserve global batch size), sequence length 2048, bf16 mixed precision, fp32 reduce-scatter - FSDP2 (AC): 50.12% peak active, 60.97% peak reserved, 5800-5900 WPS - FSDP2 (SAC): 76.49% peak active, 90.14% peak reserved, 6100-6300 WPS - Loss curves match 8-way FSDP - FSDP1 + SP has incorrect numerics due to the `FSDP.clip_grad_norm_` not all-reducing over TP mesh dimension <details> <summary> Loss curves </summary> <img width="732" alt="Screenshot 2024-03-26 at 3 31 19 PM" src="https://github.com/pytorch/torchtrain/assets/31054793/59ec71cc-ad0a-4dd1-b5c6-a8cbf9ab5e85"> </details> **Meta-Device Initialization** - The PyTorch Core guideline is for `module.reset_parameters()` to only initialize parameters/buffers immediately owned by `module` (i.e. `module.parameters(recurse=False)` and `module.buffers(recurse=False)`). - This makes it challenging to specify custom initializations for core modules like `nn.Linear` and `nn.Embedding`. For example, in @lessw2020's depth-wise truncated normal initialization, the `trunc_normal_` standard deviation depends on the layer ID, which is a property of the `TransformerBlock` but affects the child `nn.Linear`s. - To disambiguate, I suggest avoiding the name `reset_parameters()` in the case that we violate the PyTorch Core guideline and instead use a different name (e.g. `init_weights`). **DCP & Save/Load** - Tested 1D and 2D by specifying `checkpoint_folder = "/tmp/checkpoint_andgu` in the `.toml`, training until saving a checkpoint, terminating the run, and restarting the training to load the checkpoint -- the loss after loading looks reasonable * plot losses in loaded TrainState to TensorBoard ghstack-source-id: f13612ce1f739219c31aa2b9222259f9f586126b Pull Request resolved: https://github.com/pytorch/torchtrain/pull/173 * Removed setting global flag for `swap_tensors` since not needed anymore ghstack-source-id: 484237b30ba8bf8bb9e7a9cf2c97180d9fb21295 Pull Request resolved: https://github.com/pytorch/torchtrain/pull/178 * Add integration test with compile enabled (#183) Summary: same as title Test Plan: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model_compile.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_compile.toml W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] ***************************************** W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] ***************************************** [rank0]:2024-04-01 17:54:35,779 - root - INFO - Starting job: LLaMA debug training [rank1]:2024-04-01 17:54:35,797 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-04-01 17:54:36,063 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-04-01 17:54:36,069 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-04-01 17:54:36,071 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-04-01 17:54:36,078 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-04-01 17:54:36,078 - root - INFO - Preparing alpaca dataset from HuggingFace [rank1]:2024-04-01 17:54:36,449 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank1]:2024-04-01 17:54:36,454 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank1]:2024-04-01 17:54:36,456 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-04-01 17:54:36,463 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank1]:2024-04-01 17:54:36,463 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-04-01 17:54:37,631 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-04-01 17:54:37,643 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-04-01 17:54:37,644 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied FSDP to the model [rank1]:2024-04-01 17:54:38,310 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank1]:2024-04-01 17:54:38,324 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank1]:2024-04-01 17:54:38,325 - root - INFO - GPU capacity: NVIDIA H100 (1) with 95.04GiB memory [rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied selective activation checkpointing to the model [rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied FSDP to the model [rank1]:2024-04-01 17:54:38,699 - root - INFO - Gradient scaling not enabled [rank1]:2024-04-01 17:54:38,699 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754 [rank1]:2024-04-01 17:54:38,701 - root - INFO - Compiling model with torch.compile [rank0]:2024-04-01 17:54:38,692 - root - INFO - Gradient scaling not enabled [rank0]:2024-04-01 17:54:38,693 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754 [rank0]:2024-04-01 17:54:38,694 - root - INFO - Compiling model with torch.compile [rank0]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank1]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank1]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager. [rank1]: warnings.warn( [rank0]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager. [rank0]: warnings.warn( [rank1]:2024-04-01 17:54:40,498 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:40,493 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:41,992 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:41,985 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:42,180 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:42,187 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:43,947 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:43,963 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:43,971 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:43,920 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:43,951 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:43,974 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:44,029 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:44,033 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:45,907 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:45,933 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:47,561 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:47,667 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:47,649 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:47,706 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:49,084 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:49,108 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:49,110 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:49,086 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:49,114 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:49,131 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:50,546 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:50,638 - root - INFO - running build_ext [rank0]:2024-04-01 17:54:51,901 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:52,025 - root - INFO - running build_ext [rank1]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep: 1 �[32mloss: 10.9746 �[33mmemory: 9.53GiB(10.03%) �[34mwps: 1,228 �[35mmfu: 0.02%�[39m [rank1]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank1]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep: 2 �[32mloss: 10.9091 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 208,739 �[35mmfu: 2.56%�[39m [rank0]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep: 1 �[32mloss: 10.9746 �[33mmemory: 9.53GiB(10.03%) �[34mwps: 1,228 �[35mmfu: 0.02%�[39m [rank0]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05 [rank0]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep: 2 �[32mloss: 10.9091 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 208,501 �[35mmfu: 2.55%�[39m [rank1]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep: 3 �[32mloss: 10.7722 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 219,416 �[35mmfu: 2.69%�[39m [rank0]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep: 3 �[32mloss: 10.7722 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 219,182 �[35mmfu: 2.68%�[39m [rank1]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep: 4 �[32mloss: 10.5428 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 218,226 �[35mmfu: 2.67%�[39m [rank0]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep: 4 �[32mloss: 10.5428 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 218,015 �[35mmfu: 2.67%�[39m [rank1]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep: 5 �[32mloss: 10.3063 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 207,094 �[35mmfu: 2.54%�[39m [rank0]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep: 5 �[32mloss: 10.3063 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 207,220 �[35mmfu: 2.54%�[39m [rank1]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep: 6 �[32mloss: 10.0707 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 210,814 �[35mmfu: 2.58%�[39m [rank1]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep: 7 �[32mloss: 9.8302 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 209,649 �[35mmfu: 2.57%�[39m [rank0]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep: 6 �[32mloss: 10.0707 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 210,849 �[35mmfu: 2.58%�[39m [rank0]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep: 7 �[32mloss: 9.8302 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 209,542 �[35mmfu: 2.57%�[39m [rank0]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep: 8 �[32mloss: 9.5918 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 211,690 �[35mmfu: 2.59%�[39m [rank1]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep: 8 �[32mloss: 9.5918 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 211,786 �[35mmfu: 2.59%�[39m [rank1]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep: 9 �[32mloss: 9.4299 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 125,833 �[35mmfu: 1.54%�[39m [rank1]:[rank1]:[W401 17:54:53.242673953 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep: 9 �[32mloss: 9.4299 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 125,765 �[35mmfu: 1.54%�[39m [rank0]:[rank0]:[W401 17:54:53.240925776 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10 �[32mloss: 9.2955 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 207,661 �[35mmfu: 2.54%�[39m [rank0]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10 �[32mloss: 9.2955 �[33mmemory: 9.54GiB(10.03%) �[34mwps: 207,426 �[35mmfu: 2.54%�[39m [rank0]:NCCL version 2.20.5+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com> * remove folding and unfolding of sequence dim in model.py ghstack-source-id: 5d299adcd766baad6a36e63be4acc01fb2fd36db Pull Request resolved: https://github.com/pytorch/torchtrain/pull/190 * bump comm.train_timeout_seconds (#189) this PR bumps this default config to a larger value, as profiling is pretty heavy step, a default 5 seconds would likely trigger watchdog unintentionally * fix checkpoint parser ghstack-source-id: 47ee7b5e2228705e5215195ac9ff13e1b168f93e Pull Request resolved: https://github.com/pytorch/torchtrain/pull/197 * support sequence of tests and add checkpoint test address comments ghstack-source-id: 7d6c51a5ef68dea06ba7d64741a554165c79f1d3 Pull Request resolved: https://github.com/pytorch/torchtrain/pull/198 * Make freqs_cis a persistent buffer for pp init currently, planning to use a 'seed checkpoint' to initialize the pipeline parallel model chunks after moving them from meta device to cuda/empty. non-persistent buffers are incompatible with this approach, as they are missing from the checkpoint and thus require manual init. an alternative is to manually run the initializer for just the non-persistent buffers after loading a seed-checkpoint, but this approach is nearly equivalent with less code changes. ghstack-source-id: b48228488d4c3924fffef4237f4106383c14a934 Pull Request resolved: https://github.com/pytorch/torchtrain/pull/201 * Delete grad scaler, which is unsupported/unused grad scaler currently doesn't work with FSDP2, and isn't enabled anyway becuase bf16 training is the norm and doens't require it. remove it for simplicity. it will be easier to enable pipeline parallelism with a simplier loss function setup, but if desired, its still possible to support pipeline parallelism with the scaler added back in. ghstack-source-id: 82b0e4324eac88ee62723a6d832182d4e6c76e0f Pull Request resolved: https://github.com/pytorch/torchtrain/pull/202 * Factor out loss_fn to share code with pipeline par PP requires feeding a loss_fn into the schedule's step so that loss can be computed per microbatch as part of the forward/backward scheduling. As such, it is nice to define loss once and use it both in the non-pp code that manually calls f/loss/b and also use it in the pp step(). ghstack-source-id: 9bedd5103e23627d5e268c287d49f0759442ba12 Pull Request resolved: https://github.com/pytorch/torchtrain/pull/203 * [TorchTrain] Minor fix for #197 (#204) The changes made in github editor didn't go in when doing ghstack land. * Add FusedRMSNorm (Triton kernel, +15% eager), Add NPLayerNorm, Enable config selectable Norm Type (#181) This PR has multiple aspects: 1 - Adds a new Triton based Fused RMSNorm I wrote. I've verified it's numerical accuracy on both forward and backward with a unit test. It improves MFU by +15% with FSDP2 7B, and compiled slightly by +1.2%: <img width="545" alt="Screenshot 2024-03-29 at 5 18 14 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/8f16fae9-947b-4720-a370-b954779c33a7"> 2 - Adds norms.py to house all 4 norm types, and standardizes to [layernorm / np_layernorm / rmsnorm / fused_rmsnorm]. Norms.py has a create_norms function that then creates the appropriate norm. 3 - Adds np_layernorm, which is layernorm with no affine transformation. 4 - Updates model.py to now support plug and play of any supported norm. Thus instead of this type of if/then logic in the model class: <img width="928" alt="Screenshot 2024-03-30 at 1 52 07 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/ba7cb976-580f-4471-a79b-a584f7d20693"> We simply have this: <img width="1129" alt="Screenshot 2024-03-30 at 1 52 23 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/aba48b4d-1620-4059-840d-e620468f00f2"> This then allows for easy plug and play of any norm type with no fiddling around in the model code. 5 - updates run_llama_train.sh to randomly select a port vs previous fixed port number. (thanks @yifuwang for this tip!) 6 - Now users can quickly select the norm of their choice via the config file: <img width="774" alt="Screenshot 2024-03-30 at 3 01 43 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/3238b375-dc21-4ee2-a5fa-f6571da79edb"> 7 - adds a NotImpl error if users try to run TP + fused_rnsmorm to avoid any confusion (per @tianyu-l feedback): ~~~ NotImplementedError: fused_rmsnorm not yet compatible with TP. Please use rmsnorm. ~~~ * remove .item() per iter ghstack-source-id: ab29c214604fd76cefdfe70149ecf07a2e03103e Pull Request resolved: https://github.com/pytorch/torchtrain/pull/206 * Removed cache_k and cache_v comments ghstack-source-id: 8bc66c683a801189b152b0ef4301579ec1ec17e7 Pull Request resolved: https://github.com/pytorch/torchtrain/pull/213 * Some more cleanups ghstack-source-id: a53cbbecc35eac2a62d8ebc241462ac418666336 Pull Request resolved: https://github.com/pytorch/torchtrain/pull/212 * avoid record streams and make color printing a config ghstack-source-id: 1c7cb2710330ec3fb2384793b5ad77c65b107cbc Pull Request resolved: https://github.com/pytorch/torchtrain/pull/195 * fix SAC to use the correct reduce_scatter op (#215) as titled, we migrated to the native functional collective so the SAC should capture this instead of the old one * Test runner raises exception on failures (#216) Summary: Test runner should raise exception on failures. Test Plan: ``` =====Integration test, flavor : , command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh ===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + overrides= + '[' 0 -ne 0 ']' =====Integration test, flavor : 1D compile, command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh --training.compile===== + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=--training.compile + NGPU=4 + LOG_RANK=0 + CONFIG_FILE=./train_configs/debug_model.toml + overrides= + '[' 1 -ne 0 ']' + overrides=--training.compile + torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml --training.compile W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** [rank0]:2024-04-10 13:32:45,243 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-04-10 13:32:45,676 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-04-10 13:32:46,028 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-04-10 13:32:46,030 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-04-10 13:32:46,038 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-04-10 13:32:46,038 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-04-10 13:32:47,813 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True, norm_type='fused_rmsnorm') [rank0]:2024-04-10 13:32:47,826 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-04-10 13:32:47,826 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied FSDP to the model [rank0]:2024-04-10 13:32:48,582 - root - INFO - GPU memory usage for model: 0.04GiB(0.05%) [rank0]:2024-04-10 13:32:48,582 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240410-1332 [rank0]:2024-04-10 13:32:48,584 - root - INFO - Compiling model with torch.compile [rank0]:2024-04-10 13:32:49,384 - root - INFO - Training starts at step 1 [rank0]:2024-04-10 13:32:49,385 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:[rank0]:W0410 13:32:49.487000 139672077292544 torch/_logging/_internal.py:1016] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored [rank0]:[rank0]: Traceback (most recent call last): [rank0]:[rank0]: File "/data/users/gnadathur/a/torchtitan/train.py", line 394, in <module> [rank0]:[rank0]: main(config) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper [rank0]:[rank0]: return f(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main [rank0]:[rank0]: pred = model(input_ids) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]:[rank0]: return self._call_impl(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl [rank0]:[rank0]: return forward_call(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn [rank0]:[rank0]: return fn(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]:[rank0]: return self._call_impl(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl [rank0]:[rank0]: result = forward_call(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors [rank0]:[rank0]: return callback(frame, cache_entry, hooks, frame_state, skip=1) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame [rank0]:[rank0]: result = inner_convert( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert [rank0]:[rank0]: return _compile( [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function [rank0]:[rank0]: return function(*args, **kwargs) [rank0]:[rank0]: File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner [rank0]:[rank0]: return func(*args, **kwds) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile [rank0]:[rank0]: guarded_code = compile_inner(code, one_graph, hooks, transform) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper [rank0]:[rank0]: r = func(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner [rank0]:[rank0]: out_code = transform_code_object(code, transform) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object [rank0]:[rank0]: transformations(instructions, code_options) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 168, in _fn [rank0]:[rank0]: return fn(*args, **kwargs) [rank0]:[rank0]: File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 508, in transform [rank0]:[rank0]: tracer.run() [rank0]:[rank0]: File "/data/u…

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 26, 2024

awgu force-pushed the per_param_land branch 2 times, most recently from e9a9c11 to 52e7e01 Compare March 26, 2024 19:48

awgu marked this pull request as ready for review March 26, 2024 21:18

awgu commented Mar 26, 2024

View reviewed changes

tianyu-l reviewed Mar 26, 2024

View reviewed changes

tianyu-l mentioned this pull request Mar 26, 2024

simple meta init #164

Closed

wanchaol reviewed Mar 27, 2024

View reviewed changes

awgu force-pushed the per_param_land branch 3 times, most recently from ee5087b to dbb793a Compare March 27, 2024 19:09

awgu requested review from tianyu-l and wanchaol March 27, 2024 19:19

wanchaol approved these changes Mar 27, 2024

View reviewed changes

tianyu-l approved these changes Mar 27, 2024

View reviewed changes

Used per-parameter FSDP

568cc2f

awgu force-pushed the per_param_land branch from dbb793a to 568cc2f Compare March 27, 2024 20:08

awgu commented Mar 27, 2024

View reviewed changes

This was referenced Mar 27, 2024

Validate 1D FSDP2 parity with FSDP1 #106

Closed

Validate FSDP2 + SP parity with FSDP1 + SP #107

Closed

awgu merged commit 6d3d906 into pytorch:main Mar 28, 2024
4 checks passed

awgu deleted the per_param_land branch March 28, 2024 18:54

gnadathur mentioned this pull request Apr 9, 2024

Added initial FSDP readme #209

Merged

152334H mentioned this pull request Jun 12, 2024

Speed benchmarks vs FSDP2 yandex/YaFSDP#3

Open

awgu mentioned this pull request Jun 21, 2024

Incorrect behavior of dtensor full_tensor for TP+FSDP2 pytorch/pytorch#129229

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Used per-parameter FSDP #165

Used per-parameter FSDP #165

awgu commented Mar 26, 2024 •

edited

Loading

awgu Mar 26, 2024

tianyu-l left a comment

tianyu-l Mar 26, 2024

awgu Mar 27, 2024

tianyu-l Mar 26, 2024

awgu Mar 27, 2024

wanchaol left a comment

wanchaol Mar 27, 2024

awgu Mar 27, 2024

wanchaol Mar 27, 2024

awgu Mar 27, 2024

wanchaol left a comment

wanchaol Mar 27, 2024

awgu commented Mar 27, 2024

tianyu-l left a comment

awgu Mar 27, 2024

awgu Mar 27, 2024

awgu commented Mar 28, 2024

awgu commented Mar 28, 2024

Used per-parameter FSDP #165

Used per-parameter FSDP #165

Conversation

awgu commented Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awgu commented Mar 27, 2024

tianyu-l left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awgu commented Mar 28, 2024

awgu commented Mar 28, 2024

awgu commented Mar 26, 2024 •

edited

Loading