Skip to content

Conversation

@zpcore
Copy link
Contributor

@zpcore zpcore commented Jul 24, 2025

(Split out the large PR #46)
Support the implicit replication fallback startegy.

How to use Implicit replication fallback:

from autoparallel.dtensor_util import strategy_pool
with strategy_pool.replicate_for_unsupported_operators():
    ... # (missing ops will use replicated strategy if possible)

Note: StrategyPool reuses the _op_dispatcher.sharding_propagator.op_strategy_funcs/op_to_rules/op_to_schema_info by reference now.

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Jul 24, 2025
ghstack-source-id: 22de7f1
Pull Request resolved: #49
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 24, 2025
@zpcore zpcore changed the title Support of explicit fallback Support of implicit fallback Jul 24, 2025
@zpcore
Copy link
Contributor Author

zpcore commented Jul 24, 2025

Had an offline discussion regarding #46 (comment), since op_strategy_context is only in the upstream test code base, we will using replicate_for_unsupported_operators in this PR. We can remove it once op_strategy_context is available.

(Split out the large PR #46)
Support the implicit replication fallback startegy.

How to use Implicit replication fallback:
```python
from autoparallel.dtensor_util import strategy_pool
with strategy_pool.replicate_for_unsupported_operators():
    ... # (missing ops will use replicated strategy if possible)
```

Note: StrategyPool reuses the _op_dispatcher.sharding_propagator.op_strategy_funcs/op_to_rules/op_to_schema_info by reference now.




[ghstack-poisoned]
@zpcore zpcore requested review from XilunWu, ezyang, fmassa and wconstab July 25, 2025 00:14
replicate_op_strategy = torch.distributed.tensor._ops.utils.replicate_op_strategy


class StrategyPool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question would be, if we have the context manager above, do we actually need a StrategyPool class that maintains copies of the dtensor registries? We should probably pick one approach or the other. If we use the context manager, then a way to keep track of it here could be to use an ExitStack as I mentioned in #46

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I removed the StrategyPool, now the structure is simpler.

)


@contextmanager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have this in the above utility file we can delete it from here right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we can upstream the with_implicit_strategies.

)
else:
# No stack available, just register permanently
register_op_strategy(op)(replicate_op_strategy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Won't this register the op into dtensor itself? But above we are checking if the op is registered in our COPY of dtensor's registry, and I don't see us updating our copy. Should we just delete our copy and use this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.op_strategy_funcs in StrategyPool is a reference to upstream op_strategy_funcs instead of COPY. Let me remove the reference and use upstream op_strategy_funcs to make it clear.

(Split out the large PR #46)
Support the implicit replication fallback startegy.

How to use Implicit replication fallback:
```python
from autoparallel.dtensor_util import strategy_pool
with strategy_pool.replicate_for_unsupported_operators():
    ... # (missing ops will use replicated strategy if possible)
```

Note: StrategyPool reuses the _op_dispatcher.sharding_propagator.op_strategy_funcs/op_to_rules/op_to_schema_info by reference now.




[ghstack-poisoned]
](
op_schema
)
out_strat = get_op_strategy(op, op_schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this in its own refactor and land it asap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I merge this PR first so that we can quickly play with batch sharding strategy?

# replication strategy fallback.
class CustomShardingPropagator(
torch.distributed.tensor._sharding_prop.ShardingPropagator
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally down on out of core things like this which are very closely entwined to internal implementation details of another library we're relying on: it is unlikely that these APIs have any test coverage in pytorch, which means we're more likely to accidentally break autoparallel from otherwise safe refactoring changes. I haven't though closely enough about what a good architecture looks like, but our default should be to make autoparallel rely only on public APIs and move anything that needs close coordination to pytorch core. (I'm OK with landing stuff to autoparallel on a temporary basis with a clear understanding it needs to go in core.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, this sounds like we need a test for test. This is used to help do quick test on strategy correctness under eager mode.

@fmassa fmassa merged commit 8df62c4 into gh/zpcore/1/base Jul 29, 2025
5 checks passed
@fmassa
Copy link
Contributor

fmassa commented Jul 29, 2025

Damn, I merged a gh-stack PR...

zpcore added a commit that referenced this pull request Jul 29, 2025
ghstack-source-id: f0db91b
Pull Request resolved: #49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants