Larger sized mma instructions to support full vectorization #1824

shmsong · 2022-07-14T05:02:26Z

This is continuation of #1823.

A 16x16x16 macro tile for Ampere and Turing is introduced to be able to use ldmatrix.x4, for both the operands. This macro tile is not going to be necessary long term and is actually sub-optimal (not a perf blocker today). But should lead to a reasonable state to unblock schedule development.

Further optimizations and cleanups for both this and #1823 will be in follow ups but less prioritized than the next few coming PRs.

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>

into matmul_propagator

This reverts commit d12a90f.

csarofeen

LGTM

Quick question: why do you allclose? Do our tolerances not work here in validator?

shmsong · 2022-07-29T18:35:57Z

Quick question: why do you allclose? Do our tolerances not work here in validator?

Yes the validator tries to use an atol around 1e-5, which makes the test flaky, so just temporarily relaxing to 1e-4 for simplicity.

Would need to do deeper dive into precision in follow ups. Looks like serial reduction for large K would generally lose quite a bit precision.

shmsong · 2022-07-29T18:36:41Z

Going to test this one on both Ampere and Turing before merging.

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream pytorch#81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38543000](https://our.internmc.facebook.com/intern/diff/D38543000) Pull Request resolved: pytorch#83067 Approved by: https://github.com/davidberard98

shmsong and others added 30 commits July 11, 2022 22:15

use custom propagator in ampere TN

6f5ba21

add tile ordering utilities

2329caf

initial matmul scheduler implementation

121af43

use matmul scheduler prototype on ampere and turing test cases

f958c53

extend to support Volta

397f74c

minor cleanup

00d9a57

comment cleanup

d7035aa

minor fix

9ffc61d

add fragment iteration and use it in matmul scheduler

ed0f525

use scheduler params for tests

c972116

fragment support in double buffer

d12a90f

add register double buffering test cases

c306b9b

add ampere large mma op

dead4ba

add large Turing mma op

50d9444

clean up custom transform propagator

63f561f

Merge remote-tracking branch 'origin/devel' into matmul_propagator

3d47c1f

rebase fix

29f88c7

comment

d029b9f

move bounded selector to common area

5ac053f

Add logic to handle fake boundary tensors in selection.

b51d247

naming and comment

aba5087

remove unused parameters from mma node

426c381

remove unnecessary parameters from mma ir node

6d4f377

rename scheduling variables

5e1f41f

change accumulator tv interface

1960da9

Update torch/csrc/jit/codegen/cuda/scheduler/utils.h

3a411c2

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>

PR feedback

8f2e4da

Merge branch 'matmul_propagator' of https://github.com/csarofeen/pytorch

eef3a97

into matmul_propagator

pipe through parallel type position

6ad2967

Merge remote-tracking branch 'origin/devel' into matmul_propagator

65c8f0a

shmsong added 8 commits July 20, 2022 16:38

Revert "fragment support in double buffer"

cd03b00

This reverts commit d12a90f.

Merge branch 'matmul_propagator' into fragment_iter

380dd66

use cache op to handle double buffer input

6ce6ff6

add more comment in matmul scheduler

62f09fc

more comments

538aa8b

comment fix

91f44fd

Merge branch 'fragment_iter' into large_tile_mma_op

2668493

naming and comments

7a27e84

shmsong changed the title ~~WIP: [Not ready for review] Larger sized mma instructions to support full vectorization~~ Larger sized mma instructions to support full vectorization Jul 21, 2022

shmsong requested a review from csarofeen July 21, 2022 07:19

csarofeen approved these changes Jul 29, 2022

View reviewed changes

Base automatically changed from fragment_iter to devel July 29, 2022 18:29

Merge remote-tracking branch 'origin/devel' into large_tile_mma_op

1bc25fd

shmsong merged commit e0ae11a into devel Jul 30, 2022

shmsong deleted the large_tile_mma_op branch July 30, 2022 01:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Larger sized mma instructions to support full vectorization #1824

Larger sized mma instructions to support full vectorization #1824

shmsong commented Jul 14, 2022 •

edited

Loading

csarofeen left a comment

shmsong commented Jul 29, 2022

shmsong commented Jul 29, 2022

Larger sized mma instructions to support full vectorization #1824

Larger sized mma instructions to support full vectorization #1824

Conversation

shmsong commented Jul 14, 2022 • edited Loading

csarofeen left a comment

Choose a reason for hiding this comment

shmsong commented Jul 29, 2022

shmsong commented Jul 29, 2022

shmsong commented Jul 14, 2022 •

edited

Loading