Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Used per-parameter FSDP #165

Merged
merged 1 commit into from
Mar 28, 2024
Merged

Used per-parameter FSDP #165

merged 1 commit into from
Mar 28, 2024

Conversation

awgu
Copy link
Contributor

@awgu awgu commented Mar 26, 2024

Numeric Parity
1D FSDP

  • Eager: 1k steps of minipile on 8 H100 GPUs, local batch size 8, sequence length 2048, AC/SAC, bf16 mixed precision, fp32 reduce-scatter
    • FSDP1 (AC): 24.81% peak active, 33.82% peak reserved, 6100-6200 WPS
    • FSDP1 (SAC): 52.98% peak active, 67.23% peak reserved, 6500-6700 WPS
    • FSDP2 (AC): 23.92% peak active, 32.64% peak reserved, 6100-6300 WPS
    • FSDP2 (SAC): 52.13% peak active, 62.51% peak reserved, 6600-6800 WPS
    • Loss curves match between FSDP1 and FSDP2
    • Memory numbers reported as percentage since that is how they are logged; can convert against 95.0396 GiB GPU memory
  • Compile: same setup as eager
    • FSDP2 (AC), buffer reuse disabled: 28.72 GiB (30.22%) peak reserved, 7200-7500 WPS, 33% MFU
    • FSDP2 (AC), buffer reuse enabled: 28.90 GiB (30.40%) peak reserved, 7200-7500 WPS, 33% MFU
    • FSDP2 (SAC), buffer reuse enabled: 53.83 GiB (56.64%) peak reserved, 8100-8400 WPS, 36% MFU
    • Loss curves slightly better than eager
    • For fun -- how much can we push MFU?
      • If we use FSDP2 (SAC) with 16 local batch size (doubled), we get 88.23 GiB (92.84%) peak reserved, 8600 WPS, 38% MFU.
      • If we use FSDP2 (no AC) with 8 local batch size, we get 90.28 GiB (94.99%) peak reserved, 9100-9300 WPS, 40% MFU.
  • Why is FSDP2 faster? (1) fp32 reduce-scatter only uses one div kernel instead of two and (2), reshard_after_forward=False for the last transformer block

2D FSDP

  • Eager (2-way SP, 4-way FSDP): 1k steps of minipile on 8 H100 GPUs, local batch size 16 (to preserve global batch size), sequence length 2048, bf16 mixed precision, fp32 reduce-scatter
    • FSDP2 (AC): 50.12% peak active, 60.97% peak reserved, 5800-5900 WPS
    • FSDP2 (SAC): 76.49% peak active, 90.14% peak reserved, 6100-6300 WPS
  • Loss curves match 8-way FSDP
  • FSDP1 + SP has incorrect numerics due to the FSDP.clip_grad_norm_ not all-reducing over TP mesh dimension
Loss curves Screenshot 2024-03-26 at 3 31 19 PM

Meta-Device Initialization

  • The PyTorch Core guideline is for module.reset_parameters() to only initialize parameters/buffers immediately owned by module (i.e. module.parameters(recurse=False) and module.buffers(recurse=False)).
  • This makes it challenging to specify custom initializations for core modules like nn.Linear and nn.Embedding. For example, in @lessw2020's depth-wise truncated normal initialization, the trunc_normal_ standard deviation depends on the layer ID, which is a property of the TransformerBlock but affects the child nn.Linears.
  • To disambiguate, I suggest avoiding the name reset_parameters() in the case that we violate the PyTorch Core guideline and instead use a different name (e.g. init_weights).

DCP & Save/Load

  • Tested 1D and 2D by specifying checkpoint_folder = "/tmp/checkpoint_andgu in the .toml, training until saving a checkpoint, terminating the run, and restarting the training to load the checkpoint -- the loss after loading looks reasonable

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 26, 2024
@awgu awgu force-pushed the per_param_land branch 2 times, most recently from e9a9c11 to 52e7e01 Compare March 26, 2024 19:48
@awgu awgu marked this pull request as ready for review March 26, 2024 21:18
transformer_block = checkpoint_wrapper(
transformer_block, job_config.activation_checkpoint
)
# As an optimization, do not reshard after forward for the last
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to not including this 'trick' since it might be confusing. The idea is that we basically can reshard_after_forward=False for the last transformer block for free.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wonderful work!
Left some comments, some of which are my questions.

torchtrain/models/llama/model.py Outdated Show resolved Hide resolved
@@ -333,13 +313,13 @@ def __init__(self, model_args: ModelArgs):
super().__init__()
self.model_args = model_args
self.tok_embeddings = nn.Embedding(model_args.vocab_size, model_args.dim)
self.init_weights()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems self.init_weights() or self.reset_parameters() are called in all but Attention and FeedForward modules (probably because init_std is not available during __init__?).

This creates a bit inconsistency in terms of how many times a parameter/buffer is initialized. Does it make sense to unify the behavior, e.g. all init_weights() or reset_parameters() are called from parent other than the Transformer itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following offline discussion, I changed it so that self.init_weights() is only called in Transformer.__init__() and not in any other __init__(). This meant one change to the RotaryEmbedding.__init__() to register the freqs_cis buffer. The rest remains the same.

@@ -359,6 +339,16 @@ def forward(self, tokens: torch.Tensor):
freqs_cis = self.freqs_cis[0:seqlen]
return h, freqs_cis

def init_weights(self):
if hasattr(self, "freqs_cis"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding correctly that currently, each branch of this if-else will be called once during meta init; and the first branch will be called again when model.init_weights() is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep!

@tianyu-l tianyu-l mentioned this pull request Mar 26, 2024
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great first pass! mainly have some confusions about meta init part

@@ -207,19 +205,10 @@ def __init__(self, model_args: ModelArgs):
model_args.n_heads * self.head_dim, model_args.dim, bias=False
)

def reset_parameters(self, init_std):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I have some confusions about the reset_parameters guideline, so reset_parameters is an optional method in nn.Module, and it does not "recursively" call into the submodule's reset_parameters call when calling the parent module's reset_parameters().

This means that if the guideline is that each module should ONLY be responsible to its own parameter, user have to loop all the submodules in the module tree and call them individually?

And if that's the case, if user decide to not recursively loop submodules, one can simply define reset_parameters to re-init its own parameters + its leaf module parameters just like we did previously (i.e. nn.Attention we can also re-init the q/k/v linears), so that user can simply call reset_parameters() on their defined root module's reset_parameters() function and not worrying about the attention layer wq/wk/wv be overriden by the builtin nn.Linear's reset_parameter call, since it would never call that. This might be sth user already doing as they might want to control how the submodule init works themselves?

Not sure if you get my question haha, am I missing sth there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that if the guideline is that each module should ONLY be responsible to its own parameter, user have to loop all the submodules in the module tree and call them individually?

This is my understanding.

And if that's the case, if user decide to not recursively loop submodules, one can simply define reset_parameters to re-init its own parameters + its leaf module parameters just like we did previously (i.e. nn.Attention we can also re-init the q/k/v linears), so that user can simply call reset_parameters() on their defined root module's reset_parameters() function and not worrying about the attention layer wq/wk/wv be overriden by the builtin nn.Linear's reset_parameter call, since it would never call that. This might be sth user already doing as they might want to control how the submodule init works themselves?

I agree with the approach you are mentioning

  • if we ignore FSDP
  • if we are using FSDP1 and every weight init does not depend on the original tensor shape

It happens to be that the weight init used for the Llama model in torchtrain does not depend on the original tensor shape (namely, the weight init is elementwise). However, this may not be the case for other models (e.g. those that compute fan-in/fan-out), in which case this approach would silently sample from the incorrect distribution.

FSDP1 calls reset_parameters() before sharding.

  • The current approach is aligned with the core guideline, so for FullyShardedDataParallel(module), FSDP1 calls submodule.reset_parameters() for each managed submodule in module.modules() (managed is defined by excluding any nested FullyShardedDataParallel modules or their children). This is the only way to ensure that each parameter is initialized exactly once.
  • If a parent Attention module re-initialized its Q/K/V linear modules, then FSDP1 would initialize the Q/K/V linears twice (once from Linear.reset_parameters() and once from Attention.reset_parameters()). This can still give a valid probability distribution, but it could give different values for a fixed seed compared to if the Linear.reset_parameters() were skipped (e.g. if not using FSDP and just calling model.reset_parameters() on the root model). This is not a major problem since it does not mean incorrect randomness but is still worth mentioning.
  • If we further call model.reset_parameters() after sharding with FSDP1, then we have 1D flattened sharded tensors, which no longer preserve the original tensor shape. Therefore, calling model.reset_parameters() at this point will give incorrect randomness in cases depending on the shape.

In summary, following the core guideline is the only way to guarantee that each parameter is initialized once and before sharding. The constraint to initialize once is not required for correct randomness but may help reproducibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ok this make sense, so it is critical to only initialize it once for reproducibility when starting a fixed seed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same time though, the DTensor RNG will be different than local, so I am not sure if this reproducibility argument makes sense. We would not be able to ensure same results for FSDP2 compared to a single-GPU non-DTensor setup.

torchtrain/parallelisms/parallelize_llama.py Show resolved Hide resolved
torchtrain/parallelisms/parallelize_llama.py Show resolved Hide resolved
@awgu awgu force-pushed the per_param_land branch 3 times, most recently from ee5087b to dbb793a Compare March 27, 2024 19:09
@awgu awgu requested review from tianyu-l and wanchaol March 27, 2024 19:19
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! lgtm :)

torch.nn.utils.clip_grad_norm_(
model.parameters(), job_config.training.max_norm
)
torch.nn.utils.clip_grad_norm_(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the fact that it composes with existing impl instead of using a separate impl!

@awgu
Copy link
Contributor Author

awgu commented Mar 27, 2024

After pytorch/pytorch#122801 lands, the save/load with torch.compile should work. (I tested locally.)

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!

@@ -199,7 +197,6 @@ def main(job_config: JobConfig):

# torch.compile model for improved performance
if job_config.training.compile:
torch._inductor.config.allow_buffer_reuse = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since pytorch/pytorch#122444 landed, we can re-enable buffer reuse.

@@ -186,6 +179,11 @@ def main(job_config: JobConfig):
model = models_parallelize_fns[model_name](
model, world_mesh, parallel_dims, job_config
)
# set this as required by DTensor to work with `to_empty`
# TODO: remove in the future when enabled by default for wrapper subclasses
torch.__future__.set_swap_module_params_on_conversion(True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After pytorch/pytorch#122755, we can remove this call.

@awgu
Copy link
Contributor Author

awgu commented Mar 28, 2024

If anything breaks because of this PR, please ping me :)

@awgu awgu merged commit 6d3d906 into pytorch:main Mar 28, 2024
4 checks passed
@awgu awgu deleted the per_param_land branch March 28, 2024 18:54
@awgu
Copy link
Contributor Author

awgu commented Mar 28, 2024

Local batch size 6, torch.compile, bf16 mixed precision, no AC, reshard_after_forward=False for all transformer blocks, 8x H100s:
9250-9400 WPS, 40.9-41.5% MFU

lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
**Numeric Parity**
1D FSDP
- Eager: 1k steps of minipile on 8 H100 GPUs, local batch size 8,
sequence length 2048, AC/SAC, bf16 mixed precision, fp32 reduce-scatter
- FSDP1 (AC): 24.81% peak active, 33.82% peak reserved, 6100-6200 WPS
- FSDP1 (SAC): 52.98% peak active, 67.23% peak reserved, 6500-6700 WPS
- FSDP2 (AC): 23.92% peak active, 32.64% peak reserved, 6100-6300 WPS
- FSDP2 (SAC): 52.13% peak active, 62.51% peak reserved, 6600-6800 WPS
    - Loss curves match between FSDP1 and FSDP2
- Memory numbers reported as percentage since that is how they are
logged; can convert against 95.0396 GiB GPU memory
- Compile: same setup as eager
- FSDP2 (AC), buffer reuse disabled: 28.72 GiB (30.22%) peak reserved,
7200-7500 WPS, 33% MFU
- FSDP2 (AC), buffer reuse enabled: 28.90 GiB (30.40%) peak reserved,
7200-7500 WPS, 33% MFU
- FSDP2 (SAC), buffer reuse enabled: 53.83 GiB (56.64%) peak reserved,
8100-8400 WPS, 36% MFU
    - Loss curves slightly better than eager
    - For fun -- how much can we push MFU?
- If we use FSDP2 (SAC) with 16 local batch size (doubled), we get 88.23
GiB (92.84%) peak reserved, 8600 WPS, 38% MFU.
- If we use FSDP2 (no AC) with 8 local batch size, we get 90.28 GiB
(94.99%) peak reserved, 9100-9300 WPS, 40% MFU.
- Why is FSDP2 faster? (1) fp32 reduce-scatter only uses one div kernel
instead of two and (2), `reshard_after_forward=False` for the last
transformer block

2D FSDP
- Eager (2-way SP, 4-way FSDP): 1k steps of minipile on 8 H100 GPUs,
local batch size 16 (to preserve global batch size), sequence length
2048, bf16 mixed precision, fp32 reduce-scatter
- FSDP2 (AC): 50.12% peak active, 60.97% peak reserved, 5800-5900 WPS
- FSDP2 (SAC): 76.49% peak active, 90.14% peak reserved, 6100-6300 WPS
- Loss curves match 8-way FSDP
- FSDP1 + SP has incorrect numerics due to the `FSDP.clip_grad_norm_`
not all-reducing over TP mesh dimension

<details>
<summary> Loss curves </summary>

<img width="732" alt="Screenshot 2024-03-26 at 3 31 19 PM"
src="https://github.com/pytorch/torchtrain/assets/31054793/59ec71cc-ad0a-4dd1-b5c6-a8cbf9ab5e85">

</details>


**Meta-Device Initialization**
- The PyTorch Core guideline is for `module.reset_parameters()` to only
initialize parameters/buffers immediately owned by `module` (i.e.
`module.parameters(recurse=False)` and `module.buffers(recurse=False)`).
- This makes it challenging to specify custom initializations for core
modules like `nn.Linear` and `nn.Embedding`. For example, in
@lessw2020's depth-wise truncated normal initialization, the
`trunc_normal_` standard deviation depends on the layer ID, which is a
property of the `TransformerBlock` but affects the child `nn.Linear`s.
- To disambiguate, I suggest avoiding the name `reset_parameters()` in
the case that we violate the PyTorch Core guideline and instead use a
different name (e.g. `init_weights`).

**DCP & Save/Load**
- Tested 1D and 2D by specifying `checkpoint_folder =
"/tmp/checkpoint_andgu` in the `.toml`, training until saving a
checkpoint, terminating the run, and restarting the training to load the
checkpoint -- the loss after loading looks reasonable
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
**Numeric Parity**
1D FSDP
- Eager: 1k steps of minipile on 8 H100 GPUs, local batch size 8,
sequence length 2048, AC/SAC, bf16 mixed precision, fp32 reduce-scatter
- FSDP1 (AC): 24.81% peak active, 33.82% peak reserved, 6100-6200 WPS
- FSDP1 (SAC): 52.98% peak active, 67.23% peak reserved, 6500-6700 WPS
- FSDP2 (AC): 23.92% peak active, 32.64% peak reserved, 6100-6300 WPS
- FSDP2 (SAC): 52.13% peak active, 62.51% peak reserved, 6600-6800 WPS
    - Loss curves match between FSDP1 and FSDP2
- Memory numbers reported as percentage since that is how they are
logged; can convert against 95.0396 GiB GPU memory
- Compile: same setup as eager
- FSDP2 (AC), buffer reuse disabled: 28.72 GiB (30.22%) peak reserved,
7200-7500 WPS, 33% MFU
- FSDP2 (AC), buffer reuse enabled: 28.90 GiB (30.40%) peak reserved,
7200-7500 WPS, 33% MFU
- FSDP2 (SAC), buffer reuse enabled: 53.83 GiB (56.64%) peak reserved,
8100-8400 WPS, 36% MFU
    - Loss curves slightly better than eager
    - For fun -- how much can we push MFU?
- If we use FSDP2 (SAC) with 16 local batch size (doubled), we get 88.23
GiB (92.84%) peak reserved, 8600 WPS, 38% MFU.
- If we use FSDP2 (no AC) with 8 local batch size, we get 90.28 GiB
(94.99%) peak reserved, 9100-9300 WPS, 40% MFU.
- Why is FSDP2 faster? (1) fp32 reduce-scatter only uses one div kernel
instead of two and (2), `reshard_after_forward=False` for the last
transformer block

2D FSDP
- Eager (2-way SP, 4-way FSDP): 1k steps of minipile on 8 H100 GPUs,
local batch size 16 (to preserve global batch size), sequence length
2048, bf16 mixed precision, fp32 reduce-scatter
- FSDP2 (AC): 50.12% peak active, 60.97% peak reserved, 5800-5900 WPS
- FSDP2 (SAC): 76.49% peak active, 90.14% peak reserved, 6100-6300 WPS
- Loss curves match 8-way FSDP
- FSDP1 + SP has incorrect numerics due to the `FSDP.clip_grad_norm_`
not all-reducing over TP mesh dimension

<details>
<summary> Loss curves </summary>

<img width="732" alt="Screenshot 2024-03-26 at 3 31 19 PM"
src="https://github.com/pytorch/torchtrain/assets/31054793/59ec71cc-ad0a-4dd1-b5c6-a8cbf9ab5e85">

</details>


**Meta-Device Initialization**
- The PyTorch Core guideline is for `module.reset_parameters()` to only
initialize parameters/buffers immediately owned by `module` (i.e.
`module.parameters(recurse=False)` and `module.buffers(recurse=False)`).
- This makes it challenging to specify custom initializations for core
modules like `nn.Linear` and `nn.Embedding`. For example, in
@lessw2020's depth-wise truncated normal initialization, the
`trunc_normal_` standard deviation depends on the layer ID, which is a
property of the `TransformerBlock` but affects the child `nn.Linear`s.
- To disambiguate, I suggest avoiding the name `reset_parameters()` in
the case that we violate the PyTorch Core guideline and instead use a
different name (e.g. `init_weights`).

**DCP & Save/Load**
- Tested 1D and 2D by specifying `checkpoint_folder =
"/tmp/checkpoint_andgu` in the `.toml`, training until saving a
checkpoint, terminating the run, and restarting the training to load the
checkpoint -- the loss after loading looks reasonable
tianyu-l added a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Sep 8, 2024
* Load missing keys default from argparse (#111)

```
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] 
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
[rank0]:2024-03-04 17:01:28,834 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank1]:2024-03-04 17:01:28,857 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-03-04 17:01:29,712 - root - INFO - Starting job: debug training
[rank0]:2024-03-04 17:01:29,712 - root - INFO - Building llama
[rank0]:2024-03-04 17:01:29,719 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-04 17:01:29,719 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank1]:2024-03-04 17:01:31,187 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-03-04 17:01:31,188 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-03-04 17:01:31,346 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-03-04 17:01:31,346 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-04 17:01:31,347 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-04 17:01:31,347 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-03-04 17:01:32,502 - root - INFO - Applied FSDP to the model...
[rank0]:2024-03-04 17:01:32,503 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-03-04 17:01:32,504 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240304-1701.
[rank0]:2024-03-04 17:01:32,901 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-03-04 17:01:34,806 - root - INFO - �[36mstep:  1  �[32mloss: 10.8424  �[39miter: �[34m 1.8688�[39m  data: �[34m0.0316  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-03-04 17:01:34,891 - root - INFO - �[36mstep:  2  �[32mloss: 10.7581  �[39miter: �[34m 0.0476�[39m  data: �[34m0.0357  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-03-04 17:01:34,970 - root - INFO - �[36mstep:  3  �[32mloss: 10.6239  �[39miter: �[34m  0.045�[39m  data: �[34m0.0333  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-03-04 17:01:35,048 - root - INFO - �[36mstep:  4  �[32mloss: 10.4163  �[39miter: �[34m 0.0455�[39m  data: �[34m0.0323  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-03-04 17:01:35,127 - root - INFO - �[36mstep:  5  �[32mloss: 10.1529  �[39miter: �[34m 0.0459�[39m  data: �[34m0.032  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-03-04 17:01:35,206 - root - INFO - �[36mstep:  6  �[32mloss:  9.8899  �[39miter: �[34m 0.0468�[39m  data: �[34m0.0311  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-03-04 17:01:35,284 - root - INFO - �[36mstep:  7  �[32mloss:  9.7204  �[39miter: �[34m 0.0461�[39m  data: �[34m0.0312  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-03-04 17:01:35,425 - root - INFO - �[36mstep:  8  �[32mloss:  9.3757  �[39miter: �[34m 0.0457�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0003�[39m
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-03-04 17:01:35,537 - root - INFO - �[36mstep:  9  �[32mloss:  9.1883  �[39miter: �[34m 0.0762�[39m  data: �[34m0.0318  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-03-04 17:01:35,958 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-03-04 17:01:35,971 - root - INFO - �[36mstep: 10  �[32mloss:  9.1212  �[39miter: �[34m 0.0808�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Average iter time: 0.0553 seconds
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Average data load time: 0.0317 seconds
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0
```

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* Add meta_init, enable it as default init process (#84)

This PR enables meta_init functionality to avoid OOM'ing on cpu for
larger models.
The core functionality is in meta_init.py, and a few changes in
parallelization and train.py.
Key items:
1 - this is largely the same as the earlier PR I had for meta_init, but
I did a new one b/c faster than reworking it with all the interim
changes.
2 - to address feedback in previous PR:
a - why do we need meta_init.py, can't we just do:
~~~
with torch.device("meta"):
    model = Model.from_args(...)
~~~
Unfortunately this does not work b/c the rope embeddings are treated
differently (buffer) and thus the simple lambda call from param_init_fn
in FSDP (lambda module: module.to_device('cuda') ) will not invoke or
move the rope embeddings and the model will fail on first forward.
This issue relates to the nn.embeddings not being moved, and that the
device is referenced in the forward pass for the current rope class.
Have opened https://github.com/pytorch/torchtrain/issues/110 to track
this and investigate while not holding up meta init that is working from
landing.

b - per earlier feedback - meta init is now 'not optional' but simply
the default. This should ensure all models leverage it and ensure we
aren't missing things for future meta_init aspects.

3 - misc change - I switched the model_params to just do the normal all
params count instead of 'unique params' b/c it does not mesh with what
people perceive model size as.

Testing:
tested both debugmodel and 26B model with and without meta init to
confirm same loss curves.
Note for future reference - if you get a bad init (meta init failure)
you will simply not train (loss is same every iter).
If you fail to call reset params after FSDP, then you will train (b/c we
default to torch.randn_like) but your starting loss will be 5x+ higher
(telling you that you have not properly init'ed the model).

* Fix feedback from PR 111 (#113)

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* fix SP minor issues

ghstack-source-id: 5133a8d97ad209b569e0fc528e58daafdd31d80d
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/114

* enable loss parallel in SP

ghstack-source-id: a0c8b4454f75ad1cd9824ac89a1df0182f6a7d8c
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/112

* Float8_experimental option for training (#102)

* add miniPile dataset for pretraining, 1M entries (solves the 'out of data' at 40 iters issue) (#88)

This PR add's minipile (1M, 6GB) dataset as an option for pretraining
with torchtrain.
It resolves the issue where we run out of data after 40 iterations with
the default alpaca dataset.
Per @tianyu-l's excellent suggestion, have refactored to have a single
hf_datasets.py file that supports both minipile and alpaca since it
turned out no need for any different tokenizer, etc.
Also cleaned up the datasets package so that create_tokenizer is exposed
directly, and thus all public apis can be used directly from
torchtrain.datasets.
Lastly - added warning if/when a dataset is being re-looped so users
don't get burned by overfitting:
<img width="1294" alt="Screenshot 2024-03-06 at 5 11 09 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/82480b6f-c677-4794-80c5-5c10b037732a">


Adds a color highlight to showcase what dataloader was built:
<img width="1360" alt="Screenshot 2024-03-05 at 9 19 10 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/4717ec6a-14bb-4283-a3ae-fa40c27deee0">
and
<img width="1360" alt="Screenshot 2024-03-05 at 9 22 01 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/dbf32d51-2dd4-4526-8855-9b33b627559e">


Usage:
just add "minipile" or "alpaca" as the dataset in the training config
toml file.
<img width="439" alt="Screenshot 2024-02-25 at 12 35 26 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/1afbaed1-07f8-4e37-b8cc-80190db7fb27">

Testing:
verified training loss is improving and ran for 100 iters to verify no
issue with out of data any longer with minipile.
reran with alpaca and saw the expected out of data at 40 iters without
infinite loop option, runs to 100 with infinite.

Notes:
I did not make this a default dataset since for debugmodel, mostly
running 10 iters is fine and there's 6GB to pull down.
<img width="869" alt="Screenshot 2024-02-25 at 12 30 29 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/1070a80a-ad20-4f0f-a860-e13caa3120a0">

* add data loading option to load from local file system

ghstack-source-id: 3c930054d3b04faf3866048740a2ef887d066dd6
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/117

* add llama 13B configs

ghstack-source-id: 733bf85716cda3a5b9af780eba79c9b5dd66abad
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/121

* add llama 70B toml

ghstack-source-id: d7cd26d84aa2442ac45223992e1766397e52c8d8
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/122

* set betas and weight decay for optimizers

according to suggestions in https://github.com/pytorch/torchtrain/issues/118#issuecomment-1986470746

ghstack-source-id: 357f0872cd1c9bad2c4c256d47adbd3f716a7651
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/123

* Add c4 dataset (177M, streaming), update multi-node support for latest job configs (#124)

This PR:
1 - adds the english language portion of c4 dataset, which has 177M
entries. (https://huggingface.co/datasets/allenai/c4)

Due to the size, streaming is enabled as the default.  
This is the allen-ai/c4, as apparently the original c4 is being
deprecated and HF advises to use allen-ai now.

For comparison per @tianyu-l request - 40 iterations avg time:
alpaca cached loading: Average data load time: 0.0279 seconds
c4 streaming loading: Average data load time: 0.0290 seconds

There is a longer initial delay during the 'preparing c4' vs alpaca
(i.e. 45 seconds vs 10 seconds), but after that speed is similar.

Dataset sample (not displayed in training, just an excerpt I pulled to
double check the data flow):
<img width="1233" alt="Screenshot 2024-03-08 at 5 31 06 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/94915f80-da70-48d1-8c43-43f874fef121">

2 - I also updated the multi-node slurm file to account for the new job
config.

Test:
verified no looping with 100 iterations, 
sampled data streamed to verify.

* Add openwebtext dataset for larger scale training without shuffling (#130)

This PR adds the openwebtext 1M dataset. 
This is a homogenous dataset, so we are able to train successfully while
not having any shuffle in our dataset loader.

1 - adds the dateset to hf_datasets
2 - makes the default dataset for 13b and 70b as openwebtext since that
is the preferred choice for larger scale training.

Testing - ran 5K iters (9 nodes) to verify no spiking issues:

<img width="787" alt="Screenshot 2024-03-12 at 9 50 57 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/420fa1fc-50f8-47bc-9b07-02c8fa132e7c">

* [TorchTrain][Checkpoint] Fix TrainState state_dict to unblock loading (#131)

This fix would temporarily unblock loading. So we won't run into the
issue of:

```
[rank0]:[rank0]:     train_state.losses.append(train_state.current_loss)
[rank0]:[rank0]: AttributeError: 'float' object has no attribute 'append'
```

However, current_loss and losses are still not correct, since by current
setup, losses and current_losses would be different across different
ranks. Also, we don't know the size of losses because this is based on
the # of steps. So loading still work but the value of current_loss and
losses are not being loaded correctly.

I will follow up with further fixes.

* improve logging

ghstack-source-id: de61ec093b43a2ccbf1156c76ba81ecd698a6a8a
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/132

* use SequenceParallel style in tp/sp (#133)

simplify things given we already have SequenceParallel style landed in
main

* support TP-only parallelism

ghstack-source-id: c13ebb8de8e8e9203624b5dd710a046d17311b0f
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/137

* disable verbose print from profiling

ghstack-source-id: ca6eb8f42bf3c2a59d8e6389e7fe94ed85103099
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/136

* add Selective layer  activation checkpointing, single control for turning AC on or off. (#125)

This PR:
1 - adds selective layer checkpointing - this lets the user select every
x layer to checkpoint:
i.e. 2 = every other layer is checkpointed.

spec for config was updated by Wanchao - so we now have this layout for
AC which is hopefully self-explanatory (covers None, full, Selective Op
or Selective Layer and layer filtering policy:
<img width="941" alt="Screenshot 2024-03-13 at 6 09 52 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/4b992286-1fbd-4a14-957a-4325f81a9ab4">


Thus, it lets user toggle between traditional 'all layers' to more and
more fine grained checkpointing.
Note that I implemented this for IBM last summer and in their llama
testing, every 2nd layer was the best bang/buck so I have made that the
default.

2 - the config file has been updated to make a new
[activation_checkpointing] section and make it easier to modify vs being
dumped into the training section.

Testing and results:
I tested all the AC options to ensure all options are working, and that
we fail if both types are set to true in config:
<img width="608" alt="Screenshot 2024-03-09 at 3 43 52 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/e3c20fbf-73e2-492d-9fb9-f32e772e239e">

* remove per iter syncronize

ghstack-source-id: 581c9115e89d3de57e558175b527c12c06a6808c
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/134

* Shorten nccl comm  timeout and enable flight recorder dumping (#103)

Timeout
-------

It's convenient whether during iterative debugging or long running
training to find out asap about a failure. The default timeout is way
too long and leads to wasted cluster time or developer frustration.
  
Timeout can be adjusted via cmdline or in .toml if it needs to be larger
for a particular model.

Another useful pattern can be to set a large timeout for initialization
and then tighten it after iteration 1. We can add this later if desired.

Ideally we could pass the timeout to the device mesh ctor, but it's not
ready yet. Also, we can change timeouts of the existing PGs after
creating them, but that's more LOC and not necessary unless we want to
change the timeouts at runtime.

Dumps
-----

Dumping on timeout should be a safe default for everyone. It has the
side-effect of requiring a dump path which defaults to ~/pgnccl_dump but
can be overridden via DUMP_PATH env.

The raw content of the dump is a pickle that is intended to be consumed
through scripts/tools which are under development, so it may not be easy
to know how to use these for now. As the tooling matures, we should
provide reference docs and probably print out pointers in the logs when
we perform the dump.


Test plan:
tested locally by adding a rank0 sleep for 10sec inside the training
loop, validating all 8 ranks dumped a trace.

* fix up gpu memory monitoring and logging

ghstack-source-id: 2f79d081c7724dbc34f357913671e8aefdf437b1
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/147

* Separate timeout during init and training (#149)

Allow a tighter timeout during training than during init.

Init includes the first train step, as well as any loading and setup. It
can be slower and less predictable due to various factors including lazy
initialization or jit compilation.

After the first train step, we expect more predictable runtime and
benefit from a tighter timeout to give quick feedback on a hang.

Tested by pasting this code in 2 places
```
if dp_mesh.get_local_rank() == 0 and train_state.step == 1:
   import time
   time.sleep(10)
```

(a) before calling set_pg_timeout, which did not cause a timeout (b)
after calling set_pg_timeout, which timed out

* Update activation check with updates to config manager (#152)

* Refactor to clean up parallelisms/__init__.py

(second attempt, didn't land correctly)

ghstack-source-id: 3dfec3ed134105cc5a951f8db160c8c2a9b3349b
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/154

* enable gc control scheduling to help avoid stragglers (#148)

This PR adds control over Python garbage collection to help avoid
stragglers during large scale training.
updates - this feature is now exposed as a controllable option
gc_schedule, with a default of 50.
0 = not enabled.
int = schedules gc at every int iters during training loop. 
<img width="1078" alt="Screenshot 2024-03-15 at 12 39 26 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/1ee387c5-f0a6-4366-936c-a1e281dad88f">

Effectively we disable the gc, run one collection to ensure a good
starting point, and then at the start of each gc_schedule iter, we call
gc to free up things.

By enforcing a fixed schedule for collection, it helps all ranks stay
more in synch.
Point of reference - on 512 GPU FSDP, adding this (gc_schedule=1) gave a
perf boost of ~1.5% per iter just by virtue of better synch.

(this was originally developed during dist compiler to resolve
stragglers, I believe @fegin came up with this solution).

* Add float8 specific parallel strategies (#153)

* add MFU to metrics

ghstack-source-id: 995efd6f460f3fe83ecf8d72c2178554f325485b
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/151

* disable buffer reuse for compile for now (#156)

disable buffer reuse for compile to have close numerics to eager mode,
as suggested by @Chillee

This is probably only a temp change until buff reuse fix in inductor

* refactor config manager and support cmd overrides (#157)

This PR supports explicit cmd overrides, to allow infra layers to
override certain options (the most important one is dump_folder)

* Add support for generating debug traces on failure

* rename sequence_parallel to tensor_parallel (#162)

This PR renames sequence_parallel to tensor_parallel, as sequence
parallel is only applied to rmsnorm layers, a broader name should be
tensor_parallel, maybe with sequence_parallel enabled

ghstack broken :( so using direct branch push instead

* add basic AC configs for 13B and 70B (#169)

as titled, currently 13B use selective op, and 70B use selective layer,
we can do some more experiments and adjust the configs later

* [TorchTrain][Checkpoint] Update train state to include global_avg_losses and global_max_losses (#167)

Based on discussion with @tianyu-l, we decided to only checkpoint
`global_avg_losses` and `global_max_losses` per log frequency iteration
to avoid all_reduce and device sync every iteration.
`TrainState.current_loss` and `TrainState.losses` are removed from
TrainState `state_dict()` and `load_state_dict()` call.


Tested with saving/loading with 30 steps with log_frequency = 10 and
loading with 40 steps to resume training. The numerics in
global_avg_losses and global_max_losses in the list aligns with
expected.

```
Step 30 save:
[rank0]:before save: 
self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21])


Step 30 load:
[rank0]:after load:
self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21])


Step 40 load and resume training:
[rank0]:before save: 
self.states['train_state']=TrainState(step=40, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945, 5.596909999847412], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555, 5.6796345710754395], log_steps=[1, 11, 21, 31])
```

* Basic integration test infra (#170)

Summary:
PR adds an option `use_for_integration_test`. when set to `True`, this
adds the config to the integration test suite. A test runner picks all
the configs marked for integration test and run them.

Test Plan:
```
=====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh=====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757]
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] *****************************************
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] *****************************************
[rank0]:2024-03-27 09:46:32,214 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-03-27 09:46:32,372 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-03-27 09:46:32,375 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank0]:2024-03-27 09:46:32,377 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-27 09:46:32,384 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-03-27 09:46:32,384 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-03-27 09:46:34,015 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-27 09:46:34,024 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-27 09:46:34,025 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied FSDP to the model
[rank0]:2024-03-27 09:46:34,171 - root - INFO - Model fully initialized via reset_parameters
[rank0]:2024-03-27 09:46:34,171 - root - INFO - Gradient scaling not enabled
[rank0]:2024-03-27 09:46:34,171 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-0946
[rank0]:2024-03-27 09:46:34,809 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
[rank0]:  warnings.warn(
[rank0]:2024-03-27 09:46:35,627 - root - INFO - �[36mstep:  1  �[32mloss: 10.9486  �[33mmemory:  9.42GiB(9.91%)  �[34mwps: 20,066  �[35mmfu: 0.25%�[39m
[rank0]:2024-03-27 09:46:35,627 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-03-27 09:46:35,705 - root - INFO - �[36mstep:  2  �[32mloss: 10.8786  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 212,046  �[35mmfu: 2.60%�[39m
[rank0]:2024-03-27 09:46:35,786 - root - INFO - �[36mstep:  3  �[32mloss: 10.7362  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 204,441  �[35mmfu: 2.50%�[39m
[rank0]:2024-03-27 09:46:35,863 - root - INFO - �[36mstep:  4  �[32mloss: 10.5094  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,800  �[35mmfu: 2.66%�[39m
[rank0]:2024-03-27 09:46:35,939 - root - INFO - �[36mstep:  5  �[32mloss: 10.2755  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,527  �[35mmfu: 2.65%�[39m
[rank0]:2024-03-27 09:46:36,016 - root - INFO - �[36mstep:  6  �[32mloss: 10.0318  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 214,117  �[35mmfu: 2.62%�[39m
[rank0]:2024-03-27 09:46:36,093 - root - INFO - �[36mstep:  7  �[32mloss:  9.7929  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,509  �[35mmfu: 2.65%�[39m
[rank0]:2024-03-27 09:46:36,192 - root - INFO - �[36mstep:  8  �[32mloss:  9.5539  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 166,639  �[35mmfu: 2.04%�[39m
[rank0]:2024-03-27 09:46:36,329 - root - INFO - �[36mstep:  9  �[32mloss:  9.3909  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 120,381  �[35mmfu: 1.47%�[39m
[rank0]:[rank0]:[W327 09:46:36.744143018 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-03-27 09:46:36,409 - root - INFO - �[36mstep: 10  �[32mloss:  9.2749  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 207,613  �[35mmfu: 2.54%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0

```

Reviewers:

Subscribers:

Tasks:

Tags:

---------

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* Add 2D integration test (FSDP + TP) (#171)

Summary:
Add a 2D test to integration test suite

Test Plan:

```

=====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh=====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757]
W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] *****************************************
W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] *****************************************
[rank0]:2024-03-27 14:29:49,466 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-03-27 14:29:49,615 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-03-27 14:29:49,621 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank0]:2024-03-27 14:29:49,623 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-27 14:29:49,630 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-03-27 14:29:49,630 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-03-27 14:29:51,114 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-27 14:29:51,124 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-27 14:29:51,124 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied FSDP to the model
[rank0]:2024-03-27 14:29:51,284 - root - INFO - Model fully initialized via reset_parameters
[rank0]:2024-03-27 14:29:51,284 - root - INFO - Gradient scaling not enabled
[rank0]:2024-03-27 14:29:51,285 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1429
[rank0]:2024-03-27 14:29:52,056 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
[rank0]:  warnings.warn(
[rank0]:2024-03-27 14:29:52,825 - root - INFO - �[36mstep:  1  �[32mloss: 10.7425  �[33mmemory:  9.42GiB(9.91%)  �[34mwps: 21,337  �[35mmfu: 0.26%�[39m
[rank0]:2024-03-27 14:29:52,825 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-03-27 14:29:52,905 - root - INFO - �[36mstep:  2  �[32mloss: 10.6722  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 208,060  �[35mmfu: 2.55%�[39m
[rank0]:2024-03-27 14:29:52,982 - root - INFO - �[36mstep:  3  �[32mloss: 10.5435  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 213,622  �[35mmfu: 2.62%�[39m
[rank0]:2024-03-27 14:29:53,060 - root - INFO - �[36mstep:  4  �[32mloss: 10.3359  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 212,856  �[35mmfu: 2.61%�[39m
[rank0]:2024-03-27 14:29:53,139 - root - INFO - �[36mstep:  5  �[32mloss: 10.0965  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 209,326  �[35mmfu: 2.56%�[39m
[rank0]:2024-03-27 14:29:53,215 - root - INFO - �[36mstep:  6  �[32mloss:  9.8806  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,808  �[35mmfu: 2.66%�[39m
[rank0]:2024-03-27 14:29:53,292 - root - INFO - �[36mstep:  7  �[32mloss:  9.6442  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 214,874  �[35mmfu: 2.63%�[39m
[rank0]:2024-03-27 14:29:53,367 - root - INFO - �[36mstep:  8  �[32mloss:  9.4349  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 220,877  �[35mmfu: 2.70%�[39m
[rank0]:2024-03-27 14:29:53,500 - root - INFO - �[36mstep:  9  �[32mloss:  9.2674  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 123,924  �[35mmfu: 1.52%�[39m
[rank0]:[rank0]:[W327 14:29:53.248291822 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-03-27 14:29:53,577 - root - INFO - �[36mstep: 10  �[32mloss:  9.1404  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 214,910  �[35mmfu: 2.63%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0

=====Integration test: CONFIG_FILE=./train_configs/debug_model_2d.toml NGPU=4 ./run_llama_train.sh=====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model_2d.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_2d.toml
W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757]
W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] *****************************************
W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] *****************************************
[rank0]:2024-03-27 14:30:00,872 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-03-27 14:30:01,177 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-03-27 14:30:01,182 - root - INFO - Building 2-D device mesh with ['dp', 'tp'], [2, 2]
[rank0]:2024-03-27 14:30:01,185 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-27 14:30:01,194 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-03-27 14:30:01,195 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-03-27 14:30:02,807 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-27 14:30:02,818 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-27 14:30:02,819 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-03-27 14:30:02,830 - root - INFO - Applied Sequence Parallelism to the model
[rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied FSDP to the model
[rank0]:2024-03-27 14:30:03,004 - root - INFO - Model fully initialized via reset_parameters
[rank0]:2024-03-27 14:30:03,004 - root - INFO - Gradient scaling not enabled
[rank0]:2024-03-27 14:30:03,005 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1430
[rank0]:2024-03-27 14:30:03,642 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
[rank0]:  warnings.warn(
[rank0]:2024-03-27 14:30:04,528 - root - INFO - �[36mstep:  1  �[32mloss: 10.8502  �[33mmemory:  5.71GiB(6.01%)  �[34mwps: 9,259  �[35mmfu: 0.11%�[39m
[rank0]:2024-03-27 14:30:04,528 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-03-27 14:30:04,679 - root - INFO - �[36mstep:  2  �[32mloss: 10.7671  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 54,430  �[35mmfu: 0.67%�[39m
[rank0]:2024-03-27 14:30:04,773 - root - INFO - �[36mstep:  3  �[32mloss: 10.6390  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 88,457  �[35mmfu: 1.08%�[39m
[rank0]:2024-03-27 14:30:04,864 - root - INFO - �[36mstep:  4  �[32mloss: 10.4210  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 90,384  �[35mmfu: 1.11%�[39m
[rank0]:2024-03-27 14:30:04,954 - root - INFO - �[36mstep:  5  �[32mloss: 10.1648  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 93,058  �[35mmfu: 1.14%�[39m
[rank0]:2024-03-27 14:30:05,067 - root - INFO - �[36mstep:  6  �[32mloss:  9.9451  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 72,642  �[35mmfu: 0.89%�[39m
[rank0]:2024-03-27 14:30:05,165 - root - INFO - �[36mstep:  7  �[32mloss:  9.7004  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 85,096  �[35mmfu: 1.04%�[39m
[rank0]:2024-03-27 14:30:05,251 - root - INFO - �[36mstep:  8  �[32mloss:  9.4422  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 95,860  �[35mmfu: 1.17%�[39m
[rank0]:2024-03-27 14:30:05,399 - root - INFO - �[36mstep:  9  �[32mloss:  9.2144  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 55,837  �[35mmfu: 0.68%�[39m
[rank0]:[rank0]:[W327 14:30:05.148473462 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-03-27 14:30:05,496 - root - INFO - �[36mstep: 10  �[32mloss:  9.1710  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 86,136  �[35mmfu: 1.05%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0
```

Reviewers:

Subscribers:

Tasks:

Tags:

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* Used per-parameter FSDP (#165)

**Numeric Parity**
1D FSDP
- Eager: 1k steps of minipile on 8 H100 GPUs, local batch size 8,
sequence length 2048, AC/SAC, bf16 mixed precision, fp32 reduce-scatter
- FSDP1 (AC): 24.81% peak active, 33.82% peak reserved, 6100-6200 WPS
- FSDP1 (SAC): 52.98% peak active, 67.23% peak reserved, 6500-6700 WPS
- FSDP2 (AC): 23.92% peak active, 32.64% peak reserved, 6100-6300 WPS
- FSDP2 (SAC): 52.13% peak active, 62.51% peak reserved, 6600-6800 WPS
    - Loss curves match between FSDP1 and FSDP2
- Memory numbers reported as percentage since that is how they are
logged; can convert against 95.0396 GiB GPU memory
- Compile: same setup as eager
- FSDP2 (AC), buffer reuse disabled: 28.72 GiB (30.22%) peak reserved,
7200-7500 WPS, 33% MFU
- FSDP2 (AC), buffer reuse enabled: 28.90 GiB (30.40%) peak reserved,
7200-7500 WPS, 33% MFU
- FSDP2 (SAC), buffer reuse enabled: 53.83 GiB (56.64%) peak reserved,
8100-8400 WPS, 36% MFU
    - Loss curves slightly better than eager
    - For fun -- how much can we push MFU?
- If we use FSDP2 (SAC) with 16 local batch size (doubled), we get 88.23
GiB (92.84%) peak reserved, 8600 WPS, 38% MFU.
- If we use FSDP2 (no AC) with 8 local batch size, we get 90.28 GiB
(94.99%) peak reserved, 9100-9300 WPS, 40% MFU.
- Why is FSDP2 faster? (1) fp32 reduce-scatter only uses one div kernel
instead of two and (2), `reshard_after_forward=False` for the last
transformer block

2D FSDP
- Eager (2-way SP, 4-way FSDP): 1k steps of minipile on 8 H100 GPUs,
local batch size 16 (to preserve global batch size), sequence length
2048, bf16 mixed precision, fp32 reduce-scatter
- FSDP2 (AC): 50.12% peak active, 60.97% peak reserved, 5800-5900 WPS
- FSDP2 (SAC): 76.49% peak active, 90.14% peak reserved, 6100-6300 WPS
- Loss curves match 8-way FSDP
- FSDP1 + SP has incorrect numerics due to the `FSDP.clip_grad_norm_`
not all-reducing over TP mesh dimension

<details>
<summary> Loss curves </summary>

<img width="732" alt="Screenshot 2024-03-26 at 3 31 19 PM"
src="https://github.com/pytorch/torchtrain/assets/31054793/59ec71cc-ad0a-4dd1-b5c6-a8cbf9ab5e85">

</details>


**Meta-Device Initialization**
- The PyTorch Core guideline is for `module.reset_parameters()` to only
initialize parameters/buffers immediately owned by `module` (i.e.
`module.parameters(recurse=False)` and `module.buffers(recurse=False)`).
- This makes it challenging to specify custom initializations for core
modules like `nn.Linear` and `nn.Embedding`. For example, in
@lessw2020's depth-wise truncated normal initialization, the
`trunc_normal_` standard deviation depends on the layer ID, which is a
property of the `TransformerBlock` but affects the child `nn.Linear`s.
- To disambiguate, I suggest avoiding the name `reset_parameters()` in
the case that we violate the PyTorch Core guideline and instead use a
different name (e.g. `init_weights`).

**DCP & Save/Load**
- Tested 1D and 2D by specifying `checkpoint_folder =
"/tmp/checkpoint_andgu` in the `.toml`, training until saving a
checkpoint, terminating the run, and restarting the training to load the
checkpoint -- the loss after loading looks reasonable

* plot losses in loaded TrainState to TensorBoard

ghstack-source-id: f13612ce1f739219c31aa2b9222259f9f586126b
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/173

* Removed setting global flag for `swap_tensors` since not needed anymore

ghstack-source-id: 484237b30ba8bf8bb9e7a9cf2c97180d9fb21295
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/178

* Add integration test with compile enabled (#183)

Summary:
same as title

Test Plan:
```

+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model_compile.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_compile.toml
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757]
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] *****************************************
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] *****************************************
[rank0]:2024-04-01 17:54:35,779 - root - INFO - Starting job: LLaMA debug training
[rank1]:2024-04-01 17:54:35,797 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-04-01 17:54:36,063 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-04-01 17:54:36,069 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank0]:2024-04-01 17:54:36,071 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-04-01 17:54:36,078 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-04-01 17:54:36,078 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank1]:2024-04-01 17:54:36,449 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank1]:2024-04-01 17:54:36,454 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank1]:2024-04-01 17:54:36,456 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-04-01 17:54:36,463 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank1]:2024-04-01 17:54:36,463 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-04-01 17:54:37,631 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-04-01 17:54:37,643 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-04-01 17:54:37,644 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied FSDP to the model
[rank1]:2024-04-01 17:54:38,310 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank1]:2024-04-01 17:54:38,324 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank1]:2024-04-01 17:54:38,325 - root - INFO - GPU capacity: NVIDIA H100 (1) with 95.04GiB memory
[rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied selective activation checkpointing to the model
[rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied FSDP to the model
[rank1]:2024-04-01 17:54:38,699 - root - INFO - Gradient scaling not enabled
[rank1]:2024-04-01 17:54:38,699 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754
[rank1]:2024-04-01 17:54:38,701 - root - INFO - Compiling model with torch.compile
[rank0]:2024-04-01 17:54:38,692 - root - INFO - Gradient scaling not enabled
[rank0]:2024-04-01 17:54:38,693 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754
[rank0]:2024-04-01 17:54:38,694 - root - INFO - Compiling model with torch.compile
[rank0]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank1]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank1]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank1]:  warnings.warn(
[rank0]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank0]:  warnings.warn(
[rank1]:2024-04-01 17:54:40,498 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:40,493 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:41,992 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:41,985 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:42,180 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:42,187 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:43,947 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:43,963 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:43,971 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:43,920 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:43,951 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:43,974 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:44,029 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:44,033 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:45,907 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:45,933 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:47,561 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:47,667 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:47,649 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:47,706 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:49,084 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:49,108 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:49,110 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:49,086 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:49,114 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:49,131 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:50,546 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:50,638 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:51,901 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:52,025 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep:  1  �[32mloss: 10.9746  �[33mmemory:  9.53GiB(10.03%)  �[34mwps: 1,228  �[35mmfu: 0.02%�[39m
[rank1]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank1]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep:  2  �[32mloss: 10.9091  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 208,739  �[35mmfu: 2.56%�[39m
[rank0]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep:  1  �[32mloss: 10.9746  �[33mmemory:  9.53GiB(10.03%)  �[34mwps: 1,228  �[35mmfu: 0.02%�[39m
[rank0]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep:  2  �[32mloss: 10.9091  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 208,501  �[35mmfu: 2.55%�[39m
[rank1]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep:  3  �[32mloss: 10.7722  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 219,416  �[35mmfu: 2.69%�[39m
[rank0]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep:  3  �[32mloss: 10.7722  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 219,182  �[35mmfu: 2.68%�[39m
[rank1]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep:  4  �[32mloss: 10.5428  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 218,226  �[35mmfu: 2.67%�[39m
[rank0]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep:  4  �[32mloss: 10.5428  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 218,015  �[35mmfu: 2.67%�[39m
[rank1]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep:  5  �[32mloss: 10.3063  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,094  �[35mmfu: 2.54%�[39m
[rank0]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep:  5  �[32mloss: 10.3063  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,220  �[35mmfu: 2.54%�[39m
[rank1]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep:  6  �[32mloss: 10.0707  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 210,814  �[35mmfu: 2.58%�[39m
[rank1]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep:  7  �[32mloss:  9.8302  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 209,649  �[35mmfu: 2.57%�[39m
[rank0]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep:  6  �[32mloss: 10.0707  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 210,849  �[35mmfu: 2.58%�[39m
[rank0]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep:  7  �[32mloss:  9.8302  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 209,542  �[35mmfu: 2.57%�[39m
[rank0]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep:  8  �[32mloss:  9.5918  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 211,690  �[35mmfu: 2.59%�[39m
[rank1]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep:  8  �[32mloss:  9.5918  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 211,786  �[35mmfu: 2.59%�[39m
[rank1]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep:  9  �[32mloss:  9.4299  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 125,833  �[35mmfu: 1.54%�[39m
[rank1]:[rank1]:[W401 17:54:53.242673953 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep:  9  �[32mloss:  9.4299  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 125,765  �[35mmfu: 1.54%�[39m
[rank0]:[rank0]:[W401 17:54:53.240925776 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10  �[32mloss:  9.2955  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,661  �[35mmfu: 2.54%�[39m
[rank0]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10  �[32mloss:  9.2955  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,426  �[35mmfu: 2.54%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0
```

Reviewers:

Subscribers:

Tasks:

Tags:

---------

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* remove folding and unfolding of sequence dim in model.py

ghstack-source-id: 5d299adcd766baad6a36e63be4acc01fb2fd36db
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/190

* bump comm.train_timeout_seconds (#189)

this PR bumps this default config to a larger value, as profiling is
pretty heavy step, a default 5 seconds would likely trigger watchdog
unintentionally

* fix checkpoint parser

ghstack-source-id: 47ee7b5e2228705e5215195ac9ff13e1b168f93e
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/197

* support sequence of tests and add checkpoint test

address comments

ghstack-source-id: 7d6c51a5ef68dea06ba7d64741a554165c79f1d3
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/198

* Make freqs_cis a persistent buffer for pp init

currently, planning to use a 'seed checkpoint' to initialize the
pipeline parallel model chunks after moving them from meta device to
cuda/empty.

non-persistent buffers are incompatible with this approach, as they are
missing from the checkpoint and thus require manual init.

an alternative is to manually run the initializer for just the
non-persistent buffers after loading a seed-checkpoint, but this
approach is nearly equivalent with less code changes.

ghstack-source-id: b48228488d4c3924fffef4237f4106383c14a934
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/201

* Delete grad scaler, which is unsupported/unused

grad scaler currently doesn't work with FSDP2, and isn't enabled anyway
becuase bf16 training is the norm and doens't require it.

remove it for simplicity.  it will be easier to enable pipeline
parallelism with a simplier loss function setup, but if desired, its
still possible to support pipeline parallelism with the scaler added
back in.

ghstack-source-id: 82b0e4324eac88ee62723a6d832182d4e6c76e0f
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/202

* Factor out loss_fn to share code with pipeline par

PP requires feeding a loss_fn into the schedule's step so that loss can
be computed per microbatch as part of the forward/backward scheduling.

As such, it is nice to define loss once and use it both in the non-pp
code that manually calls f/loss/b and also use it in the pp step().

ghstack-source-id: 9bedd5103e23627d5e268c287d49f0759442ba12
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/203

* [TorchTrain] Minor fix for #197 (#204)

The changes made in github editor didn't go in when doing ghstack land.

* Add FusedRMSNorm (Triton kernel, +15% eager), Add NPLayerNorm, Enable config selectable Norm Type (#181)

This PR has multiple aspects:
1 - Adds a new Triton based Fused RMSNorm I wrote. I've verified it's
numerical accuracy on both forward and backward with a unit test.
It improves MFU by +15% with FSDP2 7B, and compiled slightly by +1.2%:
<img width="545" alt="Screenshot 2024-03-29 at 5 18 14 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/8f16fae9-947b-4720-a370-b954779c33a7">

2 - Adds norms.py to house all 4 norm types, and standardizes to
[layernorm / np_layernorm / rmsnorm / fused_rmsnorm]. Norms.py has a
create_norms function that then creates the appropriate norm.

3 - Adds np_layernorm, which is layernorm with no affine transformation.

4 - Updates model.py to now support plug and play of any supported norm.

Thus instead of this type of if/then logic in the model class:
<img width="928" alt="Screenshot 2024-03-30 at 1 52 07 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/ba7cb976-580f-4471-a79b-a584f7d20693">

We simply have this:
<img width="1129" alt="Screenshot 2024-03-30 at 1 52 23 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/aba48b4d-1620-4059-840d-e620468f00f2">

This then allows for easy plug and play of any norm type with no
fiddling around in the model code.

5 - updates run_llama_train.sh to randomly select a port vs previous
fixed port number. (thanks @yifuwang for this tip!)


6 - Now users can quickly select the norm of their choice via the config
file:
<img width="774" alt="Screenshot 2024-03-30 at 3 01 43 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/3238b375-dc21-4ee2-a5fa-f6571da79edb">

7 - adds a NotImpl error if users try to run TP + fused_rnsmorm to avoid
any confusion (per @tianyu-l feedback):
~~~
NotImplementedError: fused_rmsnorm not yet compatible with TP. Please
use rmsnorm.
~~~

* remove .item() per iter

ghstack-source-id: ab29c214604fd76cefdfe70149ecf07a2e03103e
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/206

* Removed cache_k and cache_v comments

ghstack-source-id: 8bc66c683a801189b152b0ef4301579ec1ec17e7
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/213

* Some more cleanups

ghstack-source-id: a53cbbecc35eac2a62d8ebc241462ac418666336
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/212

* avoid record streams and make color printing a config

ghstack-source-id: 1c7cb2710330ec3fb2384793b5ad77c65b107cbc
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/195

* fix SAC to use the correct reduce_scatter op (#215)

as titled, we migrated to the native functional collective so the SAC
should capture this instead of the old one

* Test runner  raises exception on failures (#216)

Summary: Test runner  should raise exception on failures.

Test Plan: 

```
=====Integration test, flavor : , command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh  =====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ overrides=
+ '[' 0 -ne 0 ']'

=====Integration test, flavor : 1D compile, command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh --training.compile=====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=--training.compile
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ overrides=
+ '[' 1 -ne 0 ']'
+ overrides=--training.compile
+ torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml --training.compile W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** [rank0]:2024-04-10 13:32:45,243 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-04-10 13:32:45,676 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-04-10 13:32:46,028 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-04-10 13:32:46,030 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-04-10 13:32:46,038 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-04-10 13:32:46,038 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-04-10 13:32:47,813 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True, norm_type='fused_rmsnorm') [rank0]:2024-04-10 13:32:47,826 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-04-10 13:32:47,826 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied FSDP to the model [rank0]:2024-04-10 13:32:48,582 - root - INFO - GPU memory usage for model: 0.04GiB(0.05%) [rank0]:2024-04-10 13:32:48,582 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240410-1332 [rank0]:2024-04-10 13:32:48,584 - root - INFO - Compiling model with torch.compile [rank0]:2024-04-10 13:32:49,384 - root - INFO - Training starts at step 1 [rank0]:2024-04-10 13:32:49,385 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:[rank0]:W0410 13:32:49.487000 139672077292544 torch/_logging/_internal.py:1016] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored [rank0]:[rank0]: Traceback (most recent call last):
[rank0]:[rank0]:   File "/data/users/gnadathur/a/torchtitan/train.py", line 394, in <module>
[rank0]:[rank0]:     main(config)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
[rank0]:[rank0]:     return f(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main
[rank0]:[rank0]:     pred = model(input_ids)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:[rank0]:     return forward_call(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn
[rank0]:[rank0]:     return fn(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]:[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors
[rank0]:[rank0]:     return callback(frame, cache_entry, hooks, frame_state, skip=1)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame
[rank0]:[rank0]:     result = inner_convert(
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert
[rank0]:[rank0]:     return _compile(
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function
[rank0]:[rank0]:     return function(*args, **kwargs)
[rank0]:[rank0]:   File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner
[rank0]:[rank0]:     return func(*args, **kwds)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile
[rank0]:[rank0]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
[rank0]:[rank0]:     r = func(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner
[rank0]:[rank0]:     out_code = transform_code_object(code, transform)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object
[rank0]:[rank0]:     transformations(instructions, code_options)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 168, in _fn
[rank0]:[rank0]:     return fn(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 508, in transform
[rank0]:[rank0]:     tracer.run()
[rank0]:[rank0]:   File "/data/u…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants