merge upstream updates and adapt to the MPT fix in torch_spmd #5

tianyu-l · 2024-07-11T23:26:18Z

No description provided.

ghstack-source-id: 6f1ed49d15ce311f1bf118820965cdb5309a8030 Pull Request resolved: pytorch#419

ghstack-source-id: 39e484954814e61cdfb2ba661f0a98c83bc0ce60 Pull Request resolved: pytorch#418

ghstack-source-id: c8ed20fc585957bd164dd963307616a53991615d Pull Request resolved: pytorch#425

ghstack-source-id: cc224db8951ec7a133fd769845a4765cbedc6454 Pull Request resolved: pytorch#426

ghstack-source-id: cad7b3c41fd60ec19c0e6e7d058e8aa00602a187 Pull Request resolved: pytorch#430

ghstack-source-id: 0a03379eeb3a63b2d1ad4dff84d0e61ca82b1bbf Pull Request resolved: pytorch#429

ghstack-source-id: 5f09824cddaed6585cc094095e1e95dd070d76f4 Pull Request resolved: pytorch#435

ghstack-source-id: 6fa0dcd4bca876e10a6a8349283fb940a59ad234 Pull Request resolved: pytorch#438

…#436) Summary: After pytorch-labs/float8_experimental#300, `Float8Linear` with default settings is equivalent to `Float8DynamicLinear`. This PR changes `torchtitan` to use `Float8Linear`. To support the new UX of `float8_experimental` better, I also switched the `fp8_linear` configuration to be a boolean on whether to swap the linears or not. In the future we can add new options on how to configure each linear (scaling type, scaling granularity, etc) - saving that for a future PR. Test Plan: ``` // run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs, // verify performance and loss values do not change meaningfully between // baseline and this PR // baseline (before this PR) // 1. compile, bf16 // 2. compile, float8 // 3. compile, float8, fdsp_fp8_allgather=True // 4. compile, float8, fdsp_fp8_allgather=True, tp=2 // logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce // experiment (this PR): repeat all of the above, but with Float8Linear // logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631 ``` Reviewers: Subscribers: Tasks: Tags:

ghstack-source-id: 50b2d0c2b4c22e2f045cafd8630c16f3a8c6d35f Pull Request resolved: pytorch#444

ghstack-source-id: b4924952adeb5f16d08b60faa54690762841c422 Pull Request resolved: pytorch#445

ghstack-source-id: fb78e9eb8aa406ba87d6ad6cf2229c1027dae42f Pull Request resolved: pytorch#446

ghstack-source-id: 785c7e47651cda97ea22d0147d14b8d061ce042d Pull Request resolved: pytorch#447

ghstack-source-id: c4efb81ec6acc5442955908cc376df3e6d889af3 Pull Request resolved: pytorch#442

ghstack-source-id: 5fb0bf3d08cacf27242ec0f85d5dd3cdc03b739e Pull Request resolved: pytorch#448

ghstack-source-id: 1bd5b9d5abc8644785132f8eb2baaf8b1cfc5fb5 Pull Request resolved: pytorch#449

* Set `record_shapes=True` for profiler ghstack-source-id: 6f1ed49d15ce311f1bf118820965cdb5309a8030 Pull Request resolved: pytorch#419 * Improved `repeat_kv` eager perf ghstack-source-id: 39e484954814e61cdfb2ba661f0a98c83bc0ce60 Pull Request resolved: pytorch#418 * Adding FSDP Memory Tracking and Estimation ghstack-source-id: c8ed20fc585957bd164dd963307616a53991615d Pull Request resolved: pytorch#425 * Adding integration test for FSDP Memory Tracking and Estimation ghstack-source-id: cc224db8951ec7a133fd769845a4765cbedc6454 Pull Request resolved: pytorch#426 * by default disable heavy memory profiling ghstack-source-id: cad7b3c41fd60ec19c0e6e7d058e8aa00602a187 Pull Request resolved: pytorch#430 * Add the option to turn on async-TP ghstack-source-id: 0a03379eeb3a63b2d1ad4dff84d0e61ca82b1bbf Pull Request resolved: pytorch#429 * Modifying memory estimation options and minor changes ghstack-source-id: 5f09824cddaed6585cc094095e1e95dd070d76f4 Pull Request resolved: pytorch#435 * add comment pointing to Sequence Parallel optimization example ghstack-source-id: 6fa0dcd4bca876e10a6a8349283fb940a59ad234 Pull Request resolved: pytorch#438 * switch float8 logic from Float8DynamicLinear to Float8Linear (pytorch#436) Summary: After pytorch-labs/float8_experimental#300, `Float8Linear` with default settings is equivalent to `Float8DynamicLinear`. This PR changes `torchtitan` to use `Float8Linear`. To support the new UX of `float8_experimental` better, I also switched the `fp8_linear` configuration to be a boolean on whether to swap the linears or not. In the future we can add new options on how to configure each linear (scaling type, scaling granularity, etc) - saving that for a future PR. Test Plan: ``` // run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs, // verify performance and loss values do not change meaningfully between // baseline and this PR // baseline (before this PR) // 1. compile, bf16 // 2. compile, float8 // 3. compile, float8, fdsp_fp8_allgather=True // 4. compile, float8, fdsp_fp8_allgather=True, tp=2 // logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce // experiment (this PR): repeat all of the above, but with Float8Linear // logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631 ``` Reviewers: Subscribers: Tasks: Tags: * Removed `_experimental_support_context_fn_in_torch_utils_checkpoint` ghstack-source-id: 50b2d0c2b4c22e2f045cafd8630c16f3a8c6d35f Pull Request resolved: pytorch#444 * Reordered TP parallel plan to follow execution order ghstack-source-id: b4924952adeb5f16d08b60faa54690762841c422 Pull Request resolved: pytorch#445 * Made some stylistic changes to `apply_dp` ghstack-source-id: fb78e9eb8aa406ba87d6ad6cf2229c1027dae42f Pull Request resolved: pytorch#446 * Refactored activation checkpointing ghstack-source-id: 785c7e47651cda97ea22d0147d14b8d061ce042d Pull Request resolved: pytorch#447 * compiled RMSNorm ghstack-source-id: c4efb81ec6acc5442955908cc376df3e6d889af3 Pull Request resolved: pytorch#442 * Renamed parallel styles for transformer block weights ghstack-source-id: 5fb0bf3d08cacf27242ec0f85d5dd3cdc03b739e Pull Request resolved: pytorch#448 * Added type annotations and more stylistic changes ghstack-source-id: 1bd5b9d5abc8644785132f8eb2baaf8b1cfc5fb5 Pull Request resolved: pytorch#449 --------- Co-authored-by: Andrew Gu <andgu@fb.com> Co-authored-by: Sanket Jayant Purandare <sanketpurandare@meta.com> Co-authored-by: Yifu Wang <yifu@fb.com> Co-authored-by: Vasiliy Kuznetsov <vkuzo@users.noreply.github.com>

awgu and others added 17 commits June 24, 2024 20:30

Set record_shapes=True for profiler

abb9e15

ghstack-source-id: 6f1ed49d15ce311f1bf118820965cdb5309a8030 Pull Request resolved: pytorch#419

Improved repeat_kv eager perf

4ba94bd

ghstack-source-id: 39e484954814e61cdfb2ba661f0a98c83bc0ce60 Pull Request resolved: pytorch#418

Adding FSDP Memory Tracking and Estimation

236d2ff

ghstack-source-id: c8ed20fc585957bd164dd963307616a53991615d Pull Request resolved: pytorch#425

Adding integration test for FSDP Memory Tracking and Estimation

cb73810

ghstack-source-id: cc224db8951ec7a133fd769845a4765cbedc6454 Pull Request resolved: pytorch#426

by default disable heavy memory profiling

f3fecb2

ghstack-source-id: cad7b3c41fd60ec19c0e6e7d058e8aa00602a187 Pull Request resolved: pytorch#430

Add the option to turn on async-TP

ea8c5c8

ghstack-source-id: 0a03379eeb3a63b2d1ad4dff84d0e61ca82b1bbf Pull Request resolved: pytorch#429

Modifying memory estimation options and minor changes

b0ed7f0

ghstack-source-id: 5f09824cddaed6585cc094095e1e95dd070d76f4 Pull Request resolved: pytorch#435

add comment pointing to Sequence Parallel optimization example

21dd980

ghstack-source-id: 6fa0dcd4bca876e10a6a8349283fb940a59ad234 Pull Request resolved: pytorch#438

Removed _experimental_support_context_fn_in_torch_utils_checkpoint

2f5285b

ghstack-source-id: 50b2d0c2b4c22e2f045cafd8630c16f3a8c6d35f Pull Request resolved: pytorch#444

Reordered TP parallel plan to follow execution order

f0ca3e8

ghstack-source-id: b4924952adeb5f16d08b60faa54690762841c422 Pull Request resolved: pytorch#445

Made some stylistic changes to apply_dp

c7a6a3e

ghstack-source-id: fb78e9eb8aa406ba87d6ad6cf2229c1027dae42f Pull Request resolved: pytorch#446

Refactored activation checkpointing

bc3ec02

ghstack-source-id: 785c7e47651cda97ea22d0147d14b8d061ce042d Pull Request resolved: pytorch#447

compiled RMSNorm

7afe902

ghstack-source-id: c4efb81ec6acc5442955908cc376df3e6d889af3 Pull Request resolved: pytorch#442

Renamed parallel styles for transformer block weights

0929280

ghstack-source-id: 5fb0bf3d08cacf27242ec0f85d5dd3cdc03b739e Pull Request resolved: pytorch#448

Added type annotations and more stylistic changes

040ea1d

ghstack-source-id: 1bd5b9d5abc8644785132f8eb2baaf8b1cfc5fb5 Pull Request resolved: pytorch#449

merge upstream and adapt to the mpt fix in torch_spmd

4bc7bc5

tianyu-l merged commit d8859ab into main Jul 11, 2024

tianyu-l deleted the mpt branch July 11, 2024 23:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge upstream updates and adapt to the MPT fix in torch_spmd #5

merge upstream updates and adapt to the MPT fix in torch_spmd #5

tianyu-l commented Jul 11, 2024

merge upstream updates and adapt to the MPT fix in torch_spmd #5

merge upstream updates and adapt to the MPT fix in torch_spmd #5

Conversation

tianyu-l commented Jul 11, 2024