-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve view support on pointwise and transpose scheduler #1906
Conversation
// TODO: Since group2 only has global->shared and shared->global set op, we | ||
// can have fine-grained control of unroll/vectorization at per tensor level. | ||
// We should not be using a single global vectorize factor for the entire | ||
// group 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for future work: This also means, group2 does not need a reference tensor. Every tensor is its own reference tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM generally
The only thing that makes me nervous about tracking inner dimensions through view is what happens if the view splits the inner most dimension so that the resulting split->inner()
is actually smaller than the vector size? We'd mark the dimension as being shared inner dimension, but if we propagate the transformations the split wouldn't strictly be able to hold the vectorized dimension.
// TODO: Since group2 only has global->shared and shared->global set op, we | ||
// can have fine-grained control of unroll/vectorization at per tensor level. | ||
// We should not be using a single global vectorize factor for the entire | ||
// group 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could consider doing this for all our schedulers, but it hasn't seemed like a pragmatic concern. We should revisit (more generically) if we see any cases where this would be beneficial.
Was thinking more about this. Although the split/merges are reversible, the way transformation propagation works (and generally best effort replay) would prevent this from working. If we wanted to transform the output of a view so the resulting tensor view would match the input. We would then have to have logic that would understand these transformations are reversible so when we forward propagate the transformations from the view, it would replay those transformations onto the top of the "reversed" transformations in the output. This would require understanding that the transforms were reversed and that the input domains could map directly to the top of the output domains. This seems plausible in theory to me, but it seems more like what we're trying to do is move the view further down the DAG, as opposed to being able to "cancel out" the view. Views could be moved along the DAG, but we could actually hit some issues if for example a view involves a split, and one of the split dimensions is reduced but not the other. We could think about "propagating" the view towards the inputs of the fusion, but this seems more or less equivalent of just replaying towards inputs from the output of the view. I think if we want to consider how to schedule view with the 2D pointwise scheduler or the transpose scheduler, what we probably want to consider is if we can support the merges required by them. I.E. the merges make up disjoint sets of iter domains, and those sets have to be disjoint when mapped through views. In other words if you have 2 disjoint sets making up the initial merges of your 2D scheduler, like: |
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: pytorch#84626 Approved by: https://github.com/malfet
getInputsOutputsWithInnerDim
which makes it unable to calculate the inner-most dim of consumers of view. (schedulePointwise does not vectorize tensors correctly when there is a view #1905)getInputsOutputsWithInnerDim
for group-finding in transpose scheduler. It is better than the current approach because it can correctly map the innermost dim through view.