Nonaffine swizzle formulation ep.2: Loop swizzle variant. #1826

shmsong · 2022-07-15T04:07:00Z

This PR introduces the concept of loop swizzle, and initial support to unblock matmul use cases more details on inlined comment.

Remaining items:

Rebase and incorporate loop swizzle in matmul scheduler.
Add validation checks for loop swizzle.
(After merging fragment iteration to support fully unrolled mma ops #1823, add block swizzle to the matmul parameters).

(In a follow up) pipe through loop swizzle in replay logic.

csarofeen

Nice cleanup in the tests.
LGTM

torch/csrc/jit/codegen/cuda/lower_validation.cpp

csarofeen · 2022-07-31T01:53:14Z

torch/csrc/jit/codegen/cuda/ir_internal_nodes.h

+ // `Data` mode swizzling is a swizzle that will change the
+ // data layout in shared memory, likely in global memory buffers
+ // as well in the future and consumers of data-mode-swizzled buffers
+ // will have to swizzle the read index accordingly. Most important


Just having trouble parsing and consumers of data-mode-swizzled buffers will have to swizzle the read index accordingly.

Removed this sentence and redirected on the comment to IndexSwizzle.

This sentence is just saying the data stored in shared mem has a swizzled layout which makes the indexing different for the next stage consumers. The difference between the index math on Tshared in the data swizzle vs loop swizzle examples on the comment is an example.

torch/csrc/jit/codegen/cuda/test/test_gpu.cpp

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream pytorch#81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38543000](https://our.internmc.facebook.com/intern/diff/D38543000) Pull Request resolved: pytorch#83067 Approved by: https://github.com/davidberard98

shmsong added 11 commits July 14, 2022 21:04

loop swizzle definition

86345c6

Merge remote-tracking branch 'origin/devel' into loop_swizzle

9ecf34f

Merge remote-tracking branch 'origin/devel' into loop_swizzle

2f5e8d1

rebase fix

55055a8

format and validation

3512623

add swizzle checks

b47ac32

unify matmul tests

bfa3061

rebase fix

878434d

(TO REBASE) enable block swizzle by default

1a0eb0f

comment

7a72d44

more comment

a4e67ca

shmsong changed the title ~~WIP: [Not ready for review] Nonaffine swizzle formulation ep.2: Loop swizzle variant.~~ Nonaffine swizzle formulation ep.2: Loop swizzle variant. Jul 26, 2022

shmsong requested a review from csarofeen July 26, 2022 06:55

csarofeen approved these changes Jul 31, 2022

View reviewed changes

shmsong added 4 commits July 30, 2022 22:41

Merge remote-tracking branch 'origin/devel' into loop_swizzle

7456114

cleanup tests

c8618eb

cleanup

9dcf45d

comment

4fabfa4

shmsong merged commit 501f4aa into devel Jul 31, 2022

shmsong deleted the loop_swizzle branch July 31, 2022 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nonaffine swizzle formulation ep.2: Loop swizzle variant. #1826

Nonaffine swizzle formulation ep.2: Loop swizzle variant. #1826

shmsong commented Jul 15, 2022 •

edited

Loading

csarofeen left a comment

csarofeen Jul 31, 2022

shmsong Jul 31, 2022

Nonaffine swizzle formulation ep.2: Loop swizzle variant. #1826

Nonaffine swizzle formulation ep.2: Loop swizzle variant. #1826

Conversation

shmsong commented Jul 15, 2022 • edited Loading

csarofeen left a comment

Choose a reason for hiding this comment

csarofeen Jul 31, 2022

Choose a reason for hiding this comment

shmsong Jul 31, 2022

Choose a reason for hiding this comment

shmsong commented Jul 15, 2022 •

edited

Loading