Enable CP #433

fegin · 2024-06-27T21:00:47Z

Stack from ghstack (oldest at bottom):

-> Enable CP #433

This PR adds experimental flags and functions to enable context parallelism. We currently support on ly FSDP + CP and CP only. CP + TP is being tested.

[ghstack-poisoned]

…lelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: d57fcda Pull Request resolved: #433

[ghstack-poisoned]

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: dce65fe Pull Request resolved: #433

[ghstack-poisoned]

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: a10894a Pull Request resolved: #433

wanchaol · 2024-07-08T06:08:05Z

torchtitan/parallelisms/parallelize_llama.py

+            reshard_after_forward = (
+                int(layer_id) < len(model.layers) - 1 and not parallel_dims.pp_enabled
+            )
+            fully_shard(


Is this PR still a WIP? It seems like this just do exact same thing as applying FSDP

This PR is already working. CP requires FSDP (or DDP) to do parameter reduction for all parameters. The other CP related calls are enable_context_parallel() and context_parallel_buffers

Ohhh yes that make sense

Could the application of fully_shard to the model be factored out so that if we change the "wrapping" we do not need to make changes in two places?

[ghstack-poisoned]

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: 7eb849b Pull Request resolved: #433

fegin · 2024-07-08T22:46:16Z

torchtitan/parallelisms/parallelize_llama.py

-    dp_mesh = world_mesh["dp"] if world_mesh.ndim > 1 else world_mesh
+    if parallel_dims.cp_enabled:
+        # Manually create another device mesh for now as we don't support
+        # submesh flattening/reshape yet.


cc., @wz337

fegin · 2024-07-08T22:47:55Z

train.py

-                    del pred
-                    loss.backward()
+            with context_parallel_ctx(
+                buffers=[input_ids, labels, model.freqs_cis],


Context parallelism shards on the sequence dimension of the data. Other buffers that are sharded along the sequence dimension also needs to be adjusted accordingly. In this case, freqs_cis has the be sharded along the sequence dimension.

awgu · 2024-07-09T05:42:26Z

torchtitan/parallelisms/__init__.py

-            dp * tp * pp == self.world_size
-        ), f"Invalid parallel dims: dp({dp}) * tp({tp}) * pp({pp}) != WORLD_SIZE({self.world_size})"
+        assert dp * cp * tp * pp == self.world_size, (
+            f"Invalid parallel dims: dp({dp}) * cp ({cp}) * tp({tp}) * pp({pp}) "


nit: for consistency

Suggested change

f"Invalid parallel dims: dp({dp}) * cp ({cp}) * tp({tp}) * pp({pp}) "

f"Invalid parallel dims: dp({dp}) * cp({cp}) * tp({tp}) * pp({pp}) "

awgu · 2024-07-09T05:43:46Z

torchtitan/parallelisms/parallelize_llama.py

+            reshard_after_forward = (
+                int(layer_id) < len(model.layers) - 1 and not parallel_dims.pp_enabled
+            )
+            fully_shard(


Could the application of fully_shard to the model be factored out so that if we change the "wrapping" we do not need to make changes in two places?

awgu · 2024-07-09T05:45:08Z

torchtitan/parallelisms/parallelize_llama.py

+        model = fully_shard(
+            model, **fsdp_config, reshard_after_forward=not parallel_dims.pp_enabled
+        )


nit (not from this PR): We should probably change this in the original place too:

torchtitan/torchtitan/parallelisms/parallelize_llama.py

Lines 537 to 539 in 81012d1

model = fully_shard(

model, **fsdp_config, reshard_after_forward=not parallel_dims.pp_enabled

)

We do not need the model = fully_shard(model, ...) and can just call fully_shard(model, ...).

awgu · 2024-07-09T05:47:23Z

train.py

-                    # need to free to before bwd to avoid peaking memory
-                    del pred
-                    loss.backward()
+            with context_parallel_ctx(


nit for discussion: These nested context managers lead to a lot of indentation and then, combined with the formatter, lead to a lot of verticality.

I wonder if we should fold context_parallel_ctx into train_context since it should not hurt to make train_context have a larger scope?

[ghstack-poisoned]

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: d3d3fba Pull Request resolved: #433

[ghstack-poisoned]

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: 7ccd54f Pull Request resolved: #433

awgu · 2024-07-12T15:42:44Z

torchtitan/parallelisms/parallelize_llama.py

+        # submesh flattening/reshape yet.
+        dp_mesh = init_device_mesh(
+            world_mesh.device_type,
+            (parallel_dims.dp * parallel_dims.cp,),


I did not follow this exactly. What happens when we call init_device_mesh with a mesh shape that does not cover the global world?

For example, suppose we are composing FSDP + TP + CP. How does this init_device_mesh call know to combine the ranks from the CP mesh and FSDP mesh while accounting for the existence of a TP mesh?

[ghstack-poisoned]

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: 769512b Pull Request resolved: #433

This PR adds experimental flags and functions to enable context parallelism. We currently support on ly FSDP + CP and CP only. CP + TP is being tested. [ghstack-poisoned]

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: 5d4f276 Pull Request resolved: #433

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: 20b8844 Pull Request resolved: #433

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: 5d4f276 Pull Request resolved: #433

[ghstack-poisoned]

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: 923da26 Pull Request resolved: #433

fegin · 2024-10-01T17:59:11Z

Close in favor of #592

Update

121f0da

[ghstack-poisoned]

fegin mentioned this pull request Jun 27, 2024

Add support of DDP and experimental CompiledAutograd #432

Merged

fegin added a commit that referenced this pull request Jun 27, 2024

This PR adds experimental flags and functions to enable context paral…

3a4ab5d

…lelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: d57fcda Pull Request resolved: #433

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 27, 2024

fegin changed the title ~~This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested.~~ Enable CP Jun 27, 2024

Update

cfd0b31

[ghstack-poisoned]

fegin added a commit that referenced this pull request Jun 27, 2024

Enable CP

0846ce1

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: dce65fe Pull Request resolved: #433

Update

0c44db4

[ghstack-poisoned]

fegin added a commit that referenced this pull request Jun 27, 2024

Enable CP

242f0d7

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: a10894a Pull Request resolved: #433

wanchaol reviewed Jul 8, 2024

View reviewed changes

fegin requested a review from tianyu-l July 8, 2024 18:12

Update

a223fe8

[ghstack-poisoned]

fegin added a commit that referenced this pull request Jul 8, 2024

Enable CP

c889d3f

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: 7eb849b Pull Request resolved: #433

fegin requested review from wz337 and awgu July 8, 2024 22:45

fegin commented Jul 8, 2024

View reviewed changes

fegin requested a review from wconstab July 8, 2024 22:48

awgu reviewed Jul 9, 2024

View reviewed changes

Update

88e76d4

[ghstack-poisoned]

fegin added a commit that referenced this pull request Jul 9, 2024

Enable CP

b2c55fb

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: d3d3fba Pull Request resolved: #433

Update

11f7330

[ghstack-poisoned]

fegin added a commit that referenced this pull request Jul 9, 2024

Enable CP

36dfd92

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: 7ccd54f Pull Request resolved: #433

awgu reviewed Jul 12, 2024

View reviewed changes

Update

d9ef625

[ghstack-poisoned]

fegin mentioned this pull request Jul 16, 2024

[RFC] Enable HSDP + CP #463

Closed

fegin added 2 commits July 17, 2024 11:24

Update

d5e62de

[ghstack-poisoned]

Update

76e8c0b

[ghstack-poisoned]

tianyu-l mentioned this pull request Jul 26, 2024

what's the est timeline for releasing Context Parallel and 3D Pipeline #478

Closed

felipemello1 mentioned this pull request Jul 30, 2024

[RFC] Long context fine tuning in torchtune pytorch/torchtune#1244

Open

Update

3538828

[ghstack-poisoned]

fegin added a commit that referenced this pull request Jul 31, 2024

Enable CP

751a81a

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: 769512b Pull Request resolved: #433

Update on "Enable CP"

bc51495

This PR adds experimental flags and functions to enable context parallelism. We currently support on ly FSDP + CP and CP only. CP + TP is being tested. [ghstack-poisoned]

fegin added a commit that referenced this pull request Aug 7, 2024

Enable CP

66fea42

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: 5d4f276 Pull Request resolved: #433

fegin marked this pull request as draft August 7, 2024 06:45

tianyu-l force-pushed the gh/fegin/4/head branch from ad945df to bc51495 Compare August 16, 2024 21:00

Update

6195e81

[ghstack-poisoned]

fegin added a commit that referenced this pull request Aug 16, 2024

Enable CP

187d634

This PR adds experimental flags and functions to enable context parallelism. We currently support only FSDP + CP and CP only. CP + TP is being tested. ghstack-source-id: 923da26 Pull Request resolved: #433

fegin closed this Oct 1, 2024

fegin deleted the gh/fegin/4/head branch March 4, 2025 07:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable CP #433

Enable CP #433

Uh oh!

fegin commented Jun 27, 2024 •

edited

Loading

Uh oh!

wanchaol Jul 8, 2024

Uh oh!

fegin Jul 8, 2024

Uh oh!

wanchaol Jul 9, 2024

Uh oh!

awgu Jul 9, 2024

Uh oh!

fegin Jul 8, 2024

Uh oh!

fegin Jul 8, 2024

Uh oh!

awgu Jul 9, 2024

Uh oh!

awgu Jul 9, 2024

Uh oh!

awgu Jul 9, 2024

Uh oh!

awgu Jul 9, 2024

Uh oh!

awgu Jul 12, 2024

Uh oh!

fegin commented Oct 1, 2024

Uh oh!

Uh oh!

	f"Invalid parallel dims: dp({dp}) * cp ({cp}) * tp({tp}) * pp({pp}) "
	f"Invalid parallel dims: dp({dp}) * cp({cp}) * tp({tp}) * pp({pp}) "

	model = fully_shard(
	model, **fsdp_config, reshard_after_forward=not parallel_dims.pp_enabled
	)

Enable CP #433

Enable CP #433

Uh oh!

Conversation

fegin commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin commented Oct 1, 2024

Uh oh!

Uh oh!

fegin commented Jun 27, 2024 •

edited

Loading