Improve view support on pointwise and transpose scheduler #1906

zasdfgbnm · 2022-08-12T05:19:56Z

Fix a bug in getInputsOutputsWithInnerDim which makes it unable to calculate the inner-most dim of consumers of view. (schedulePointwise does not vectorize tensors correctly when there is a view #1905)
Use getInputsOutputsWithInnerDim for group-finding in transpose scheduler. It is better than the current approach because it can correctly map the innermost dim through view.
Transpose scheduler is still likely to fail when view exist in the graph, because transpose schedulers often uses input tensor as reference tensor, and this requires the producer->consumer transform replay for view. Fortunately, view merges and splits are reversible, so producer->consumer transform replay can be achieved by applying cancelling transforms on the consumer's rfactor domain to get back its root domain, then applying transformations based on that. This is not done is this PR, but I will write followup PRs for this.

zasdfgbnm · 2022-08-12T18:15:06Z

torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp

+  // TODO: Since group2 only has global->shared and shared->global set op, we
+  // can have fine-grained control of unroll/vectorization at per tensor level.
+  // We should not be using a single global vectorize factor for the entire
+  // group 2


Note for future work: This also means, group2 does not need a reference tensor. Every tensor is its own reference tensor.

csarofeen

LGTM generally
The only thing that makes me nervous about tracking inner dimensions through view is what happens if the view splits the inner most dimension so that the resulting split->inner() is actually smaller than the vector size? We'd mark the dimension as being shared inner dimension, but if we propagate the transformations the split wouldn't strictly be able to hold the vectorized dimension.

csarofeen · 2022-08-13T16:30:18Z

torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp

+  // TODO: Since group2 only has global->shared and shared->global set op, we
+  // can have fine-grained control of unroll/vectorization at per tensor level.
+  // We should not be using a single global vectorize factor for the entire
+  // group 2


Yeah, we could consider doing this for all our schedulers, but it hasn't seemed like a pragmatic concern. We should revisit (more generically) if we see any cases where this would be beneficial.

torch/csrc/jit/codegen/cuda/scheduler/utils.cpp

torch/csrc/jit/codegen/cuda/ir_nodes.cpp

csarofeen · 2022-08-13T21:27:10Z

Fortunately, view merges and splits are reversible, so producer->consumer transform replay can be achieved by applying cancelling transforms on the consumer's rfactor domain to get back its root domain, then applying transformations based on that. This is not done is this PR, but I will write followup PRs for this.

Was thinking more about this. Although the split/merges are reversible, the way transformation propagation works (and generally best effort replay) would prevent this from working. If we wanted to transform the output of a view so the resulting tensor view would match the input. We would then have to have logic that would understand these transformations are reversible so when we forward propagate the transformations from the view, it would replay those transformations onto the top of the "reversed" transformations in the output. This would require understanding that the transforms were reversed and that the input domains could map directly to the top of the output domains.

This seems plausible in theory to me, but it seems more like what we're trying to do is move the view further down the DAG, as opposed to being able to "cancel out" the view. Views could be moved along the DAG, but we could actually hit some issues if for example a view involves a split, and one of the split dimensions is reduced but not the other.

We could think about "propagating" the view towards the inputs of the fusion, but this seems more or less equivalent of just replaying towards inputs from the output of the view.

I think if we want to consider how to schedule view with the 2D pointwise scheduler or the transpose scheduler, what we probably want to consider is if we can support the merges required by them. I.E. the merges make up disjoint sets of iter domains, and those sets have to be disjoint when mapped through views. In other words if you have 2 disjoint sets making up the initial merges of your 2D scheduler, like:
i0, i1, i2 -> i0*i1, i2
Then your view cannot have any merges that involve portions of [i0, i1] be merged with any portion of [i2]. If that's the case, then we should be able to support the merges through the view. So we should be able to propagate in a way consistent with view to the rest of the DAG, then perform the remaining merges required to get the base form the scheduler is expecting.

torch/csrc/jit/codegen/cuda/ops/alias.cpp

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: pytorch#84626 Approved by: https://github.com/malfet

zasdfgbnm added 4 commits August 11, 2022 20:51

test

18d5bc1

fix

47cd850

fix view grouping

747c808

transpose test

9d0451f

zasdfgbnm linked an issue Aug 12, 2022 that may be closed by this pull request

schedulePointwise does not vectorize tensors correctly when there is a view #1905

Closed

zasdfgbnm changed the title ~~View support on pointwise and transpose scheduler~~ Improve view support on pointwise and transpose scheduler Aug 12, 2022

zasdfgbnm marked this pull request as draft August 12, 2022 05:36

zasdfgbnm changed the title ~~Improve view support on pointwise and transpose scheduler~~ [WIP] Improve view support on pointwise and transpose scheduler Aug 12, 2022

zasdfgbnm added 4 commits August 11, 2022 23:23

fix

88ce6d1

split inner_only and vectorize_pass

0ea00e8

TODO

788cb1d

update

55e2117

zasdfgbnm changed the title ~~[WIP] Improve view support on pointwise and transpose scheduler~~ Improve view support on pointwise and transpose scheduler Aug 12, 2022

zasdfgbnm requested a review from csarofeen August 12, 2022 07:56

zasdfgbnm marked this pull request as ready for review August 12, 2022 07:56

zasdfgbnm commented Aug 12, 2022

View reviewed changes

zasdfgbnm mentioned this pull request Aug 12, 2022

Transpose scheduler plan #1908

Open

7 tasks

Support expand->flatten

cd379df

csarofeen approved these changes Aug 13, 2022

View reviewed changes

Trivial reduce before flatten.

0fdf41a

csarofeen reviewed Aug 14, 2022

View reviewed changes

torch/csrc/jit/codegen/cuda/ops/alias.cpp Outdated Show resolved Hide resolved

zasdfgbnm added 3 commits August 16, 2022 13:15

fix reduction+flatten

b8631ad

spanning tree

784cb42

fix

1602327

zasdfgbnm merged commit d0d106a into devel Aug 16, 2022

zasdfgbnm deleted the transpose-view branch August 16, 2022 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve view support on pointwise and transpose scheduler #1906

Improve view support on pointwise and transpose scheduler #1906

zasdfgbnm commented Aug 12, 2022 •

edited

Loading

zasdfgbnm Aug 12, 2022

csarofeen left a comment

csarofeen Aug 13, 2022

csarofeen commented Aug 13, 2022

Improve view support on pointwise and transpose scheduler #1906

Improve view support on pointwise and transpose scheduler #1906

Conversation

zasdfgbnm commented Aug 12, 2022 • edited Loading

zasdfgbnm Aug 12, 2022

Choose a reason for hiding this comment

csarofeen left a comment

Choose a reason for hiding this comment

csarofeen Aug 13, 2022

Choose a reason for hiding this comment

csarofeen commented Aug 13, 2022

zasdfgbnm commented Aug 12, 2022 •

edited

Loading