introduce batch sharding strategy #50

zpcore · 2025-07-24T22:38:47Z

(Split out the large PR from #46)

Introduce the batch sharding strategy:

from torch.distributed.tensor._op_schema import RuntimeSchemaInfo
from autoparallel.dtensor_util.utils import batch_shard_strategy
from torch.distributed.tensor._ops.utils import register_op_strategy
# create strategy with input tensor 1 replicated, input tensor 2 shard on dim 0. Output tensor shard on dim 0:
custom_shard_strategy = functools.partial(batch_shard_strategy, input_shard_dim=[None, 0], output_shard_dim=[0])
# register the strategy:
register_op_strategy(new_op)(custom_shard_strategy)

For details, check func description in autoparallel/dtensor_util/utils.py and example usage in tests/test_dtensor.py.

Stack from ghstack (oldest at bottom):

ghstack-source-id: dfdc089 Pull Request resolved: #50

autoparallel/dtensor_util/utils.py

ghstack-source-id: 696aa5e Pull Request resolved: #50

ghstack-source-id: 6753840 Pull Request resolved: #50

ghstack-source-id: cc08134 Pull Request resolved: #50

ezyang · 2025-07-27T02:15:13Z

autoparallel/dtensor_util/utils.py

+def batch_shard_strategy(
+    op_schema: OpSchema,
+    input_shard_dim: list[Optional[int]],
+    output_shard_dim: list[Optional[int]],


The terminology is confusing here. Is the shard dim the batch dim? Or is it something else? If it is the batch dim, the comments below seem to imply there is only one batch dim, why is this a list?

It's batch dim. There can be multiple tensor input. Each element in input_shard_dim maps to one input tensor. Same to output tensor(s).

ezyang · 2025-07-27T02:15:57Z

autoparallel/dtensor_util/utils.py


+# -------------define universal op strategy-------------
+def batch_shard_strategy(
+    op_schema: OpSchema,


Is there a reason this takes an OpSchema and not an OpStrategy?

The type definition of OpSchema is very unclear in DTensor. Here the OpSchema is a collection of OpStrategy for all input args. While for the output, most of the time we just have one output tensor, so OpStrategy is sufficient for output.

ezyang · 2025-07-27T02:17:21Z

autoparallel/dtensor_util/utils.py

+    Note: It is the user's responsibility to make sure the sharded tensor for
+    processing is correct in shape.
+    """
+    output_type = [str(ret.type) for ret in op_schema.op._schema.returns]


Running str on the type, very suspicious!

What is the recommended way to check the tensor output type?

ghstack-source-id: 604f1b6 Pull Request resolved: #50

zpcore · 2025-07-30T21:01:17Z

Check to see if there's any concerns for this PR. Or should we merge and do a try?

ezyang · 2025-07-31T19:16:52Z

This is pretty reversible so I don't mind landing it and deciding what to do with it later. @zpcore I do hope this gets obsoleted by whatever we end up deciding to do with DTensor sharding though!

wconstab

let's land this and get some experience using the API for deepseek enablement.

@zpcore have you rebased? I'd like to at least kick off a job on llama3 mast to make sure nothing got broken

zpcore mentioned this pull request Jul 24, 2025

Support of implicit fallback #49

Merged

zpcore added a commit that referenced this pull request Jul 24, 2025

introduce batch sharding strategy

2a94f37

ghstack-source-id: dfdc089 Pull Request resolved: #50

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 24, 2025

ezyang reviewed Jul 24, 2025

View reviewed changes

autoparallel/dtensor_util/utils.py Outdated Show resolved Hide resolved

zpcore added a commit that referenced this pull request Jul 25, 2025

introduce batch sharding strategy

e87f9b2

ghstack-source-id: 696aa5e Pull Request resolved: #50

zpcore added a commit that referenced this pull request Jul 25, 2025

introduce batch sharding strategy

fbec8ab

ghstack-source-id: 6753840 Pull Request resolved: #50

zpcore requested review from XilunWu, fmassa and wconstab July 25, 2025 00:14

zpcore added a commit that referenced this pull request Jul 25, 2025

introduce batch sharding strategy

2237314

ghstack-source-id: cc08134 Pull Request resolved: #50

zpcore mentioned this pull request Jul 25, 2025

Add DeepSeekV3 and figure out what's needed to support it #29

Draft

13 tasks

ezyang reviewed Jul 27, 2025

View reviewed changes

zpcore added a commit that referenced this pull request Jul 29, 2025

introduce batch sharding strategy

ec2ea25

ghstack-source-id: 604f1b6 Pull Request resolved: #50

zpcore changed the base branch from gh/zpcore/2/base to main July 29, 2025 17:36

zpcore force-pushed the gh/zpcore/2/head branch 4 times, most recently from 01a8203 to b98c184 Compare July 30, 2025 20:07

zpcore force-pushed the gh/zpcore/2/head branch from b98c184 to bd5f109 Compare July 30, 2025 21:07

wconstab approved these changes Jul 31, 2025

View reviewed changes

Introduce the batch sharding strategy

cc28a95

zpcore force-pushed the gh/zpcore/2/head branch from bd5f109 to cc28a95 Compare July 31, 2025 23:00

zpcore merged commit 385d06e into main Jul 31, 2025
6 checks passed

zpcore deleted the gh/zpcore/2/head branch July 31, 2025 23:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

introduce batch sharding strategy #50

introduce batch sharding strategy #50

Uh oh!

zpcore commented Jul 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

ezyang Jul 27, 2025

Uh oh!

zpcore Jul 29, 2025

Uh oh!

ezyang Jul 27, 2025

Uh oh!

zpcore Jul 29, 2025

Uh oh!

ezyang Jul 27, 2025

Uh oh!

zpcore Jul 29, 2025 •

edited

Loading

Uh oh!

zpcore commented Jul 30, 2025

Uh oh!

ezyang commented Jul 31, 2025

Uh oh!

wconstab left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

introduce batch sharding strategy #50

introduce batch sharding strategy #50

Uh oh!

Conversation

zpcore commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ezyang Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

zpcore Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

ezyang Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

zpcore Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

ezyang Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

zpcore Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zpcore commented Jul 30, 2025

Uh oh!

ezyang commented Jul 31, 2025

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zpcore commented Jul 24, 2025 •

edited

Loading

zpcore Jul 29, 2025 •

edited

Loading