Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve view support on pointwise and transpose scheduler #1906

Merged
merged 13 commits into from
Aug 16, 2022

Conversation

zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Aug 12, 2022

  • Fix a bug in getInputsOutputsWithInnerDim which makes it unable to calculate the inner-most dim of consumers of view. (schedulePointwise does not vectorize tensors correctly when there is a view #1905)
  • Use getInputsOutputsWithInnerDim for group-finding in transpose scheduler. It is better than the current approach because it can correctly map the innermost dim through view.
  • Transpose scheduler is still likely to fail when view exist in the graph, because transpose schedulers often uses input tensor as reference tensor, and this requires the producer->consumer transform replay for view. Fortunately, view merges and splits are reversible, so producer->consumer transform replay can be achieved by applying cancelling transforms on the consumer's rfactor domain to get back its root domain, then applying transformations based on that. This is not done is this PR, but I will write followup PRs for this.

@zasdfgbnm zasdfgbnm changed the title View support on pointwise and transpose scheduler Improve view support on pointwise and transpose scheduler Aug 12, 2022
@zasdfgbnm zasdfgbnm marked this pull request as draft August 12, 2022 05:36
@zasdfgbnm zasdfgbnm changed the title Improve view support on pointwise and transpose scheduler [WIP] Improve view support on pointwise and transpose scheduler Aug 12, 2022
@zasdfgbnm zasdfgbnm changed the title [WIP] Improve view support on pointwise and transpose scheduler Improve view support on pointwise and transpose scheduler Aug 12, 2022
@zasdfgbnm zasdfgbnm marked this pull request as ready for review August 12, 2022 07:56
Comment on lines +291 to +294
// TODO: Since group2 only has global->shared and shared->global set op, we
// can have fine-grained control of unroll/vectorization at per tensor level.
// We should not be using a single global vectorize factor for the entire
// group 2
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for future work: This also means, group2 does not need a reference tensor. Every tensor is its own reference tensor.

@zasdfgbnm zasdfgbnm mentioned this pull request Aug 12, 2022
7 tasks
Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM generally
The only thing that makes me nervous about tracking inner dimensions through view is what happens if the view splits the inner most dimension so that the resulting split->inner() is actually smaller than the vector size? We'd mark the dimension as being shared inner dimension, but if we propagate the transformations the split wouldn't strictly be able to hold the vectorized dimension.

// TODO: Since group2 only has global->shared and shared->global set op, we
// can have fine-grained control of unroll/vectorization at per tensor level.
// We should not be using a single global vectorize factor for the entire
// group 2
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could consider doing this for all our schedulers, but it hasn't seemed like a pragmatic concern. We should revisit (more generically) if we see any cases where this would be beneficial.

torch/csrc/jit/codegen/cuda/ir_nodes.cpp Show resolved Hide resolved
@csarofeen
Copy link
Owner

Fortunately, view merges and splits are reversible, so producer->consumer transform replay can be achieved by applying cancelling transforms on the consumer's rfactor domain to get back its root domain, then applying transformations based on that. This is not done is this PR, but I will write followup PRs for this.

Was thinking more about this. Although the split/merges are reversible, the way transformation propagation works (and generally best effort replay) would prevent this from working. If we wanted to transform the output of a view so the resulting tensor view would match the input. We would then have to have logic that would understand these transformations are reversible so when we forward propagate the transformations from the view, it would replay those transformations onto the top of the "reversed" transformations in the output. This would require understanding that the transforms were reversed and that the input domains could map directly to the top of the output domains.

This seems plausible in theory to me, but it seems more like what we're trying to do is move the view further down the DAG, as opposed to being able to "cancel out" the view. Views could be moved along the DAG, but we could actually hit some issues if for example a view involves a split, and one of the split dimensions is reduced but not the other.

We could think about "propagating" the view towards the inputs of the fusion, but this seems more or less equivalent of just replaying towards inputs from the output of the view.

I think if we want to consider how to schedule view with the 2D pointwise scheduler or the transpose scheduler, what we probably want to consider is if we can support the merges required by them. I.E. the merges make up disjoint sets of iter domains, and those sets have to be disjoint when mapped through views. In other words if you have 2 disjoint sets making up the initial merges of your 2D scheduler, like:
i0, i1, i2 -> i0*i1, i2
Then your view cannot have any merges that involve portions of [i0, i1] be merged with any portion of [i2]. If that's the case, then we should be able to support the merges through the view. So we should be able to propagate in a way consistent with view to the rest of the DAG, then perform the remaining merges required to get the base form the scheduler is expecting.

@zasdfgbnm zasdfgbnm merged commit d0d106a into devel Aug 16, 2022
@zasdfgbnm zasdfgbnm deleted the transpose-view branch August 16, 2022 23:15
jjsjann123 added a commit that referenced this pull request Oct 2, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

- codegen improvement:
i. improved view support on pointwise and transpose scheduler
ii. grouped grid welford added for better outer-norm grid persistence in normalization

- misc:
i. new composite ops added: variance_mean , arange,
ii. fixes misaligned address for transpose scheduler
iii. refactor on separation of compilation API from execution API to prepare us for async compilation
iv. double type support on expression evaluator
v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

Commits that's in this PR from the devel branch:
```
89330aa Tensor factories must set the output shape as its input (#1939)
b2fd01e arange support (#1933)
56c00fd Double support on all expression evaluators (#1937)
371f282 Improve trivial reduction merge support (#1931)
1d0c267 Test `rand` in a fusion with zero tensor input (#1932)
0dab160 Fix softmax bwd sizes. (#1890)
ef98f36 Fix a bug (#1936)
63132a0 Propagate permissive mapping information into indexing pass (#1929)
b4ac2c8 Map IterationDomains through view operations. (#1919)
c0a187a do not use deprecated functions (#1935)
88de85e Upstream cherry pick fixes 0811 (#1934)
b247dcf Separate kernel compilation API from kernel execution API (#1914)
b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)
14a53e6 Nullary RNGOp (#1892)
3c3c89e Misc fixes/tuning for transpose scheduler (#1912)
20cf109 Grouped grid welford (#1921)
6cf7eb0 Transpose scheduler small dim sizes better support (#1910)
9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922)
057237f Fix CUDA driver error: misaligned address for transpose scheduler  (#1918)
3fb3d80 Add variance_mean function using Welford (#1907)
98febf6 Remove DisableOption::UnrollWithRng (#1913)
ee8ef33 Minor fix for the debug interface of using PTX directly (#1917)
6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916)
5eefa9a dopt is only available since nvrtc 11.7 (#1915)
2ec8fc7 Kill computeAtBetween (#1911)
d0d106a Improve view support on pointwise and transpose scheduler (#1906)
e71e1ec Fix name clash of RNG with shared memory (#1904)
3381793 Fix mutator and sameAs for expanded IterDomain (#1902)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552)
Pull Request resolved: pytorch#84626
Approved by: https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

schedulePointwise does not vectorize tensors correctly when there is a view
2 participants