Support of implicit fallback #49

zpcore · 2025-07-24T22:32:15Z

(Split out the large PR #46)
Support the implicit replication fallback startegy.

How to use Implicit replication fallback:

from autoparallel.dtensor_util import strategy_pool
with strategy_pool.replicate_for_unsupported_operators():
    ... # (missing ops will use replicated strategy if possible)

Note: StrategyPool reuses the _op_dispatcher.sharding_propagator.op_strategy_funcs/op_to_rules/op_to_schema_info by reference now.

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]

ghstack-source-id: 22de7f1 Pull Request resolved: #49

zpcore · 2025-07-24T22:45:35Z

Had an offline discussion regarding #46 (comment), since op_strategy_context is only in the upstream test code base, we will using replicate_for_unsupported_operators in this PR. We can remove it once op_strategy_context is available.

(Split out the large PR #46) Support the implicit replication fallback startegy. How to use Implicit replication fallback: ```python from autoparallel.dtensor_util import strategy_pool with strategy_pool.replicate_for_unsupported_operators(): ... # (missing ops will use replicated strategy if possible) ``` Note: StrategyPool reuses the _op_dispatcher.sharding_propagator.op_strategy_funcs/op_to_rules/op_to_schema_info by reference now. [ghstack-poisoned]

wconstab · 2025-07-25T01:31:24Z

autoparallel/dtensor_util/utils.py

+replicate_op_strategy = torch.distributed.tensor._ops.utils.replicate_op_strategy
+
+
+class StrategyPool:


My question would be, if we have the context manager above, do we actually need a StrategyPool class that maintains copies of the dtensor registries? We should probably pick one approach or the other. If we use the context manager, then a way to keep track of it here could be to use an ExitStack as I mentioned in #46

Good point! I removed the StrategyPool, now the structure is simpler.

wconstab · 2025-07-25T01:32:22Z

tests/test_dtensor.py

+)
+
+
+@contextmanager


If we have this in the above utility file we can delete it from here right?

Yes, if we can upstream the with_implicit_strategies.

wconstab · 2025-07-25T01:36:55Z

autoparallel/dtensor_util/utils.py

+                    )
+                else:
+                    # No stack available, just register permanently
+                    register_op_strategy(op)(replicate_op_strategy)


I'm confused. Won't this register the op into dtensor itself? But above we are checking if the op is registered in our COPY of dtensor's registry, and I don't see us updating our copy. Should we just delete our copy and use this way?

self.op_strategy_funcs in StrategyPool is a reference to upstream op_strategy_funcs instead of COPY. Let me remove the reference and use upstream op_strategy_funcs to make it clear.

(Split out the large PR #46) Support the implicit replication fallback startegy. How to use Implicit replication fallback: ```python from autoparallel.dtensor_util import strategy_pool with strategy_pool.replicate_for_unsupported_operators(): ... # (missing ops will use replicated strategy if possible) ``` Note: StrategyPool reuses the _op_dispatcher.sharding_propagator.op_strategy_funcs/op_to_rules/op_to_schema_info by reference now. [ghstack-poisoned]

ezyang · 2025-07-27T13:17:45Z

autoparallel/utils.py

-        ](
-            op_schema
-        )
+        out_strat = get_op_strategy(op, op_schema)


Do this in its own refactor and land it asap?

Should I merge this PR first so that we can quickly play with batch sharding strategy?

ezyang · 2025-07-27T13:27:53Z

tests/test_dtensor.py

+# replication strategy fallback.
+class CustomShardingPropagator(
+    torch.distributed.tensor._sharding_prop.ShardingPropagator
+):


I'm generally down on out of core things like this which are very closely entwined to internal implementation details of another library we're relying on: it is unlikely that these APIs have any test coverage in pytorch, which means we're more likely to accidentally break autoparallel from otherwise safe refactoring changes. I haven't though closely enough about what a good architecture looks like, but our default should be to make autoparallel rely only on public APIs and move anything that needs close coordination to pytorch core. (I'm OK with landing stuff to autoparallel on a temporary basis with a clear understanding it needs to go in core.)

Hah, this sounds like we need a test for test. This is used to help do quick test on strategy correctness under eager mode.

fmassa · 2025-07-29T16:16:39Z

Damn, I merged a gh-stack PR...

ghstack-source-id: f0db91b Pull Request resolved: #49

Support of explicit fallback

78ed7e0

[ghstack-poisoned]

zpcore added a commit that referenced this pull request Jul 24, 2025

Support of explicit fallback

f37b5ec

ghstack-source-id: 22de7f1 Pull Request resolved: #49

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 24, 2025

zpcore changed the title ~~Support of explicit fallback~~ Support of implicit fallback Jul 24, 2025

zpcore mentioned this pull request Jul 24, 2025

introduce batch sharding strategy #50

Merged

zpcore requested review from XilunWu, ezyang, fmassa and wconstab July 25, 2025 00:14

wconstab reviewed Jul 25, 2025

View reviewed changes

ezyang reviewed Jul 27, 2025

View reviewed changes

zpcore mentioned this pull request Jul 28, 2025

Support of implicit fallback and batch sharding strategy #46

Closed

wconstab approved these changes Jul 29, 2025

View reviewed changes

fmassa merged commit 8df62c4 into gh/zpcore/1/base Jul 29, 2025
5 checks passed

zpcore mentioned this pull request Jul 29, 2025

Support of implicit fallback #61

Merged

zpcore added a commit that referenced this pull request Jul 29, 2025

Support of implicit fallback (#61)

7d0ede7

ghstack-source-id: f0db91b Pull Request resolved: #49

		replicate_op_strategy = torch.distributed.tensor._ops.utils.replicate_op_strategy


		class StrategyPool:

		)


		@contextmanager

Support of implicit fallback #49

Support of implicit fallback #49

Uh oh!

Conversation

zpcore commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zpcore commented Jul 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fmassa commented Jul 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zpcore commented Jul 24, 2025 •

edited

Loading