Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge upstream updates and adapt to the MPT fix in torch_spmd #5

Merged
merged 17 commits into from
Jul 11, 2024
Merged

Conversation

tianyu-l
Copy link
Owner

No description provided.

awgu and others added 17 commits June 24, 2024 20:30
ghstack-source-id: 6f1ed49d15ce311f1bf118820965cdb5309a8030
Pull Request resolved: pytorch#419
ghstack-source-id: 39e484954814e61cdfb2ba661f0a98c83bc0ce60
Pull Request resolved: pytorch#418
ghstack-source-id: c8ed20fc585957bd164dd963307616a53991615d
Pull Request resolved: pytorch#425
ghstack-source-id: cc224db8951ec7a133fd769845a4765cbedc6454
Pull Request resolved: pytorch#426
ghstack-source-id: cad7b3c41fd60ec19c0e6e7d058e8aa00602a187
Pull Request resolved: pytorch#430
ghstack-source-id: 0a03379eeb3a63b2d1ad4dff84d0e61ca82b1bbf
Pull Request resolved: pytorch#429
ghstack-source-id: 5f09824cddaed6585cc094095e1e95dd070d76f4
Pull Request resolved: pytorch#435
ghstack-source-id: 6fa0dcd4bca876e10a6a8349283fb940a59ad234
Pull Request resolved: pytorch#438
…#436)

Summary:

After pytorch-labs/float8_experimental#300,
`Float8Linear` with default settings is equivalent to
`Float8DynamicLinear`. This PR changes `torchtitan` to use
`Float8Linear`.

To support the new UX of `float8_experimental` better, I also switched
the `fp8_linear` configuration to be a boolean on whether to swap the
linears or not. In the future we can add new options on how to configure
each linear (scaling type, scaling granularity, etc) - saving that for a
future PR.

Test Plan:

```
// run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs,
// verify performance and loss values do not change meaningfully between
// baseline and this PR

// baseline (before this PR)
// 1. compile, bf16
// 2. compile, float8
// 3. compile, float8, fdsp_fp8_allgather=True
// 4. compile, float8, fdsp_fp8_allgather=True, tp=2
// logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce

// experiment (this PR): repeat all of the above, but with Float8Linear
// logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631
```

Reviewers:

Subscribers:

Tasks:

Tags:
ghstack-source-id: 50b2d0c2b4c22e2f045cafd8630c16f3a8c6d35f
Pull Request resolved: pytorch#444
ghstack-source-id: b4924952adeb5f16d08b60faa54690762841c422
Pull Request resolved: pytorch#445
ghstack-source-id: fb78e9eb8aa406ba87d6ad6cf2229c1027dae42f
Pull Request resolved: pytorch#446
ghstack-source-id: 785c7e47651cda97ea22d0147d14b8d061ce042d
Pull Request resolved: pytorch#447
ghstack-source-id: c4efb81ec6acc5442955908cc376df3e6d889af3
Pull Request resolved: pytorch#442
ghstack-source-id: 5fb0bf3d08cacf27242ec0f85d5dd3cdc03b739e
Pull Request resolved: pytorch#448
ghstack-source-id: 1bd5b9d5abc8644785132f8eb2baaf8b1cfc5fb5
Pull Request resolved: pytorch#449
@tianyu-l tianyu-l merged commit d8859ab into main Jul 11, 2024
@tianyu-l tianyu-l deleted the mpt branch July 11, 2024 23:27
tianyu-l added a commit that referenced this pull request Aug 16, 2024
* Set `record_shapes=True` for profiler

ghstack-source-id: 6f1ed49d15ce311f1bf118820965cdb5309a8030
Pull Request resolved: pytorch#419

* Improved `repeat_kv` eager perf

ghstack-source-id: 39e484954814e61cdfb2ba661f0a98c83bc0ce60
Pull Request resolved: pytorch#418

* Adding FSDP Memory Tracking and Estimation

ghstack-source-id: c8ed20fc585957bd164dd963307616a53991615d
Pull Request resolved: pytorch#425

* Adding integration test for FSDP Memory Tracking and Estimation

ghstack-source-id: cc224db8951ec7a133fd769845a4765cbedc6454
Pull Request resolved: pytorch#426

* by default disable heavy memory profiling

ghstack-source-id: cad7b3c41fd60ec19c0e6e7d058e8aa00602a187
Pull Request resolved: pytorch#430

* Add the option to turn on async-TP

ghstack-source-id: 0a03379eeb3a63b2d1ad4dff84d0e61ca82b1bbf
Pull Request resolved: pytorch#429

* Modifying memory estimation options and minor changes

ghstack-source-id: 5f09824cddaed6585cc094095e1e95dd070d76f4
Pull Request resolved: pytorch#435

* add comment pointing to Sequence Parallel optimization example

ghstack-source-id: 6fa0dcd4bca876e10a6a8349283fb940a59ad234
Pull Request resolved: pytorch#438

* switch float8 logic from Float8DynamicLinear to Float8Linear (pytorch#436)

Summary:

After pytorch-labs/float8_experimental#300,
`Float8Linear` with default settings is equivalent to
`Float8DynamicLinear`. This PR changes `torchtitan` to use
`Float8Linear`.

To support the new UX of `float8_experimental` better, I also switched
the `fp8_linear` configuration to be a boolean on whether to swap the
linears or not. In the future we can add new options on how to configure
each linear (scaling type, scaling granularity, etc) - saving that for a
future PR.

Test Plan:

```
// run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs,
// verify performance and loss values do not change meaningfully between
// baseline and this PR

// baseline (before this PR)
// 1. compile, bf16
// 2. compile, float8
// 3. compile, float8, fdsp_fp8_allgather=True
// 4. compile, float8, fdsp_fp8_allgather=True, tp=2
// logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce

// experiment (this PR): repeat all of the above, but with Float8Linear
// logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631
```

Reviewers:

Subscribers:

Tasks:

Tags:

* Removed `_experimental_support_context_fn_in_torch_utils_checkpoint`

ghstack-source-id: 50b2d0c2b4c22e2f045cafd8630c16f3a8c6d35f
Pull Request resolved: pytorch#444

* Reordered TP parallel plan to follow execution order

ghstack-source-id: b4924952adeb5f16d08b60faa54690762841c422
Pull Request resolved: pytorch#445

* Made some stylistic changes to `apply_dp`

ghstack-source-id: fb78e9eb8aa406ba87d6ad6cf2229c1027dae42f
Pull Request resolved: pytorch#446

* Refactored activation checkpointing

ghstack-source-id: 785c7e47651cda97ea22d0147d14b8d061ce042d
Pull Request resolved: pytorch#447

* compiled RMSNorm

ghstack-source-id: c4efb81ec6acc5442955908cc376df3e6d889af3
Pull Request resolved: pytorch#442

* Renamed parallel styles for transformer block weights

ghstack-source-id: 5fb0bf3d08cacf27242ec0f85d5dd3cdc03b739e
Pull Request resolved: pytorch#448

* Added type annotations and more stylistic changes

ghstack-source-id: 1bd5b9d5abc8644785132f8eb2baaf8b1cfc5fb5
Pull Request resolved: pytorch#449

---------

Co-authored-by: Andrew Gu <andgu@fb.com>
Co-authored-by: Sanket Jayant Purandare <sanketpurandare@meta.com>
Co-authored-by: Yifu Wang <yifu@fb.com>
Co-authored-by: Vasiliy Kuznetsov <vkuzo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants