[BE] replace the extra DeviceMesh _flatten with mesh access #666

XilunWu · 2024-10-30T22:06:15Z

Stack from ghstack (oldest at bottom):

-> [BE] replace the extra DeviceMesh _flatten with mesh access #666

Summary
pytorch/pytorch#138945 fixes DeviceMesh access on flattened mesh which are constructed from more than 2 meshes. Refer to the fix PR for details if interested.

In #592 we avoided this issue by calling _flatten instead of direct accessing the flattened mesh. We want to turn back to mesh access which is more straightforward since the fix has been merged in PyTorch.

[ghstack-poisoned]

ghstack-source-id: 6afa471 Pull Request resolved: #666

awgu · 2024-10-30T22:07:38Z

torchtitan/parallelisms/parallelize_llama.py

-            if parallel_dims.cp_enabled
-            else world_mesh[dp_mesh_dim_names]
-        )
+        dp_mesh = world_mesh["dp_cp"] if parallel_dims.cp_enabled else world_mesh["dp"]


Is this a new DeviceMesh functionality that reacts specifically to <name1>_<name2>?

This is not new. DeviceMesh supports world_mesh[<name1>_<name2>] when the _flatten behavior was implemented. However, it has a bug -- if the flattened mesh is constructed from 3+ mesh dimensions (e.g. dp_cp is flattened using dp_shard, dp_replicate, and cp. Accessing world_mesh[dp_cp] throws error which breaks 3D/4D/5D composability).

It was flattened here
https://github.com/pytorch/torchtitan/blob/main/torchtitan/parallelisms/parallel_dims.py#L82

Can we catch the error and ask users to update to some version?

For my understanding, for dp, if hsdp is enabled, "dp" is the flatten mesh for "dp_replicate", "dp_shard", right? Otherwise, "dp" is just "dp_shard".

@wz337 , that's right. To summarize:

FSDP: the only dp dimension in mesh is "dp"

DDP: the only dp dimension in mesh is "dp"

HSDP: the basic dp dimensions in mesh are "dp_shard" and "dp_replicate", which are later on flattened into "dp"

tianyu-l

lgtm

tianyu-l · 2024-10-30T22:17:27Z

torchtitan/parallelisms/parallelize_llama.py

-            if parallel_dims.cp_enabled
-            else world_mesh[dp_mesh_dim_names]
-        )
+        dp_mesh = world_mesh["dp_cp"] if parallel_dims.cp_enabled else world_mesh["dp"]


It was flattened here
https://github.com/pytorch/torchtitan/blob/main/torchtitan/parallelisms/parallel_dims.py#L82

fegin

It's better to have a try-except to indicate users are not using the latest PyTorch.

fegin · 2024-10-30T22:18:13Z

torchtitan/parallelisms/parallelize_llama.py

-            if parallel_dims.cp_enabled
-            else world_mesh[dp_mesh_dim_names]
-        )
+        dp_mesh = world_mesh["dp_cp"] if parallel_dims.cp_enabled else world_mesh["dp"]


Can we catch the error and ask users to update to some version?

XilunWu · 2024-10-30T22:20:14Z

It's better to have a try-except to indicate users are not using the latest PyTorch.

Oh yeah that's right...

**Summary** pytorch/pytorch#138945 fixes DeviceMesh access on flattened mesh which are constructed from more than 2 meshes. Refer to the fix PR for details if interested. In #592 we avoided this issue by calling `_flatten` instead of direct accessing the flattened mesh. We want to turn back to mesh access which is more straightforward since the fix has been merged in PyTorch. [ghstack-poisoned]

ghstack-source-id: a0689ec Pull Request resolved: #666

…666)" This reverts commit 3653bf2.

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #667 Note: This PR is a reland of #666 where the PR was mistakenly merged into a wrong branch. **Summary** pytorch/pytorch#138945 fixes DeviceMesh access on flattened mesh which are constructed from more than 2 meshes. Refer to the fix PR for details if interested. In #592 we avoided this issue by calling `_flatten` instead of direct accessing the flattened mesh. We want to turn back to mesh access which is more straightforward since the fix has been merged in PyTorch.

) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ pytorch#667 Note: This PR is a reland of pytorch#666 where the PR was mistakenly merged into a wrong branch. **Summary** pytorch/pytorch#138945 fixes DeviceMesh access on flattened mesh which are constructed from more than 2 meshes. Refer to the fix PR for details if interested. In pytorch#592 we avoided this issue by calling `_flatten` instead of direct accessing the flattened mesh. We want to turn back to mesh access which is more straightforward since the fix has been merged in PyTorch.

[BE] replace the extra DeviceMesh _flatten with mesh access

47356dc

[ghstack-poisoned]

XilunWu added a commit that referenced this pull request Oct 30, 2024

[BE] replace the extra DeviceMesh _flatten with mesh access

a69cbf4

ghstack-source-id: 6afa471 Pull Request resolved: #666

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 30, 2024

awgu reviewed Oct 30, 2024

View reviewed changes

XilunWu requested review from fegin, tianyu-l and wz337 October 30, 2024 22:10

tianyu-l approved these changes Oct 30, 2024

View reviewed changes

fegin approved these changes Oct 30, 2024

View reviewed changes

XilunWu added a commit that referenced this pull request Oct 31, 2024

[BE] replace the extra DeviceMesh _flatten with mesh access

6e7061b

ghstack-source-id: a0689ec Pull Request resolved: #666

XilunWu merged commit 3653bf2 into gh/XilunWu/9/base Oct 31, 2024
5 checks passed

XilunWu added a commit that referenced this pull request Oct 31, 2024

Revert "[BE] replace the extra DeviceMesh _flatten with mesh access (#…

4f729c2

…666)" This reverts commit 3653bf2.

XilunWu mentioned this pull request Oct 31, 2024

[BE] replace the extra DeviceMesh _flatten with mesh access #667

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BE] replace the extra DeviceMesh _flatten with mesh access #666

[BE] replace the extra DeviceMesh _flatten with mesh access #666

Uh oh!

XilunWu commented Oct 30, 2024 •

edited

Loading

Uh oh!

awgu Oct 30, 2024

Uh oh!

XilunWu Oct 30, 2024

Uh oh!

tianyu-l Oct 30, 2024

Uh oh!

fegin Oct 30, 2024

Uh oh!

wz337 Oct 30, 2024

Uh oh!

XilunWu Oct 31, 2024

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Oct 30, 2024

Uh oh!

fegin left a comment

Uh oh!

fegin Oct 30, 2024

Uh oh!

XilunWu commented Oct 30, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[BE] replace the extra DeviceMesh _flatten with mesh access #666

[BE] replace the extra DeviceMesh _flatten with mesh access #666

Uh oh!

Conversation

XilunWu commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XilunWu commented Oct 30, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

XilunWu commented Oct 30, 2024 •

edited

Loading