[asynctp] Async_tp pass and ops fork + changes; Solver addition to incentivize async_tp fusable redistributions #151
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
torch/_inductor/fx_passes/micro_pipeline_tp.py,torch/distributed/_symmetric_memory/__init__.pyfor fast experimentation.PRs with changes on the top of base version to PyTorch repo:
pytorch/pytorch#162794
pytorch/pytorch#163068
pytorch/pytorch#163069
2.1 matmul + reduce_scatter (Partial -> Shard(dim), dim is not the last of matmul
(the last dim also supported, but has additional restride .contiguous() inside)
2.2 ag + matmul Shard(dim) -> R for argument_A of matmul, dim is not the last dim that will be reduced)