Propagate permissive mapping information into indexing pass #1929

shmsong · 2022-08-24T22:43:54Z

fixes #1925
fixes #1873

The indexing path today only uses permissive mapping once to find the indexing information on the right of CA axes of the consumer tensors. This limits the info propagation after setting the compute at position.

This PR tries to generalize the propagation and tries to propagate the "locally computed index" through permissive mapping.

More details:

We have a permissive binding step when constructing loop domains since reference replay:
https://github.com/csarofeen/pytorch/blob/devel/torch/csrc/jit/codegen/cuda/lower_index_compute.cpp#L716-L721

IIUC, when we do the concrete index expression traversal we prioritize iterdomains coming from the outer positions to resolve broadcast, see:
https://github.com/csarofeen/pytorch/blob/devel/torch/csrc/jit/codegen/cuda/lower_index_compute.cpp#L597-L600

and along the path we'd need to bind index math from inner dimensions which are not shared loops, and this binding could only be done through permissive mapping today in the case when a broadcast was merged.

The current permissive binding step was enough as we used to do computeAt as the final scheduling step, which would make sure there was no "local" indexing computation that needed to be done before permissive binding.

Today's new transform propagation offers more flexibility here and now permissive binding needs to be done not only on the loop domain construction step, but also on the intermediate nodes as well.

As an example: (from FusionInlineBroadcastIndexing0):

%kernel {
T2_l[ iS14{( ceilDiv(i1, 32) )}, iS18{( ceilDiv(32, 8) )}, iS19{8} ] ca_pos( 1 )
   = T0_g[ iS16{( ceilDiv(i1, 32) )}, iS17{32} ];
T3_l[ iS12{( ceilDiv(( 1 * i1 ), 32) )}, iS13{32} ] ca_pos( 1 ) produce_pos( 1)
   = broadcast( T2_l[ iS14{( ceilDiv(i1, 32) )}, iS18{( ceilDiv(32, 8) )}, iS19{8} ] ca_pos( 1 ) )
T4_g[ iS9{( ceilDiv(( i2 * i1 ), 32) )}, iS10{32} ] produce_pos( 1)
   = T3_l[ iS12{( ceilDiv(( 1 * i1 ), 32) )}, iS13{32} ] ca_pos( 1 ) produce_pos( 1)
   + T1_g[ iS1{i2}, iS2{i3} ];

TransformPrinter : 
T0_g[ iS16{( ceilDiv(i1, 32) )}, iS17{32} ]
 root domain : (iS0{i1})
  Split: iS0{i1} by factor 32 -> iS16{( ceilDiv(i1, 32) )}, iS17{32}, start offset: 0, stop offset: 0
T2_l[ iS14{( ceilDiv(i1, 32) )}, iS18{( ceilDiv(32, 8) )}, iS19{8} ] ca_pos( 1 )
 root domain : (iS3{i1})
  Split: iS3{i1} by factor 32 -> iS14{( ceilDiv(i1, 32) )}, iS15{32}, start offset: 0, stop offset: 0
  Split: iS15{32} by factor 8 -> iS18{( ceilDiv(32, 8) )}, iS19{8}, start offset: 0, stop offset: 0
T3_l[ iS12{( ceilDiv(( 1 * i1 ), 32) )}, iS13{32} ] ca_pos( 1 ) produce_pos( 1)
 root domain : (bS4{1},iS5{i1})
  Merge: bS4{1} and iS5{i1} -> iS11{( 1 * i1 )}
  Split: iS11{( 1 * i1 )} by factor 32 -> iS12{( ceilDiv(( 1 * i1 ), 32) )}, iS13{32}, start offset: 0, stop offset: 0
T1_g[ iS1{i2}, iS2{i3} ]
 root domain : (iS1{i2},iS2{i3})
T4_g[ iS9{( ceilDiv(( i2 * i1 ), 32) )}, iS10{32} ] produce_pos( 1)
 root domain : (iS6{i2},iS7{i1})
  Merge: iS6{i2} and iS7{i1} -> iS8{( i2 * i1 )}
  Split: iS8{( i2 * i1 )} by factor 32 -> iS9{( ceilDiv(( i2 * i1 ), 32) )}, iS10{32}, start offset: 0, stop offset: 0
}

This case the permissive binding today can handle:
When we try to index T3_l, traversing from iS9{( ceilDiv(( i2 * i1 ), 32) )} as the loop concrete id, we'd need to permissively bind the index of iS13{32}, which correponds to the indexing math that were "local" to T3_l (because ca_pos ==1).

This case is the one the current PR is fixing:
When we try to index T2_l , traversing from iS9{( ceilDiv(( i2 * i1 ), 32) )} as the loop concrete id, we'd need to permissively bind the index of iS15{32} for the same reason as iS13{32},. But now, iS15{32} is not a leaf node, so we'd need to first resolve "local" indexing math from iS18{( ceilDiv(32, 8) )}, iS19{8} to iS15{32} and then permissively bind result to iS15{32} .

(hope this makes some sense ... : ) ).

csarofeen

Seems strictly better, though a bit scary. Approving.

csarofeen · 2022-08-25T21:53:50Z

torch/csrc/jit/codegen/cuda/index_compute.cpp

@@ -669,13 +671,80 @@ void IndexCompute::run(const LoopIndexing& loop_indexing) {
    }
  }

+  // Resolve the index vals that could be resolved with only


Is this supposed to be that could only be resolved with the loops...?

What you're saying is you're resolving anything you can with the loops that consumer_tv doesn't share
What I'm saying is you're resolving loops that cannot be resolved without the loops that consumer_tv doesn't share with its consumers

Yes. Both only with and with only are true here. Will think about formalizing in follow ups.

csarofeen · 2022-08-25T21:54:39Z

torch/csrc/jit/codegen/cuda/index_compute.cpp

+  }
+}
+
+void IndexCompute::updateIndexMapFromPermissiveMap(const Expr* id_expr) {


Scary, but seemingly better than what exists today.

naoyam · 2022-08-27T19:04:20Z

torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h

@@ -36,6 +37,10 @@ class NVFuserTest : public ::testing::Test {
      GTEST_SKIP() << "skipping tests on pre-PASCAL GPUs";
    }
  }
+
+  void TearDown() override {
+    c10::cuda::CUDACachingAllocator::emptyCache();


Why do we need to do this every time a test is done? Does it involve cudaFree? If so, wouldn't this running the tests slower?

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: pytorch#84626 Approved by: https://github.com/malfet

shmsong added 5 commits August 24, 2022 12:32

add lower index resolution

bb8cef1

Merge remote-tracking branch 'origin/devel' into local_index_expr

66765a7

add repro

7992309

clear GPU memory after test

e9d09fe

cleanup and comment

f406e23

shmsong changed the title ~~WIP: Propagate permissive mapping information into indexing pass~~ Propagate permissive mapping information into indexing pass Aug 25, 2022

shmsong requested a review from csarofeen August 25, 2022 06:04

csarofeen approved these changes Aug 25, 2022

View reviewed changes

zasdfgbnm mentioned this pull request Aug 26, 2022

Enable transpose scheduler #1927

Merged

Merge remote-tracking branch 'origin/devel' into local_index_expr

283bc84

shmsong merged commit 63132a0 into devel Aug 26, 2022

shmsong deleted the local_index_expr branch August 26, 2022 19:16

naoyam reviewed Aug 27, 2022

View reviewed changes

naoyam mentioned this pull request Sep 28, 2022

Test util cleanup #2003

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate permissive mapping information into indexing pass #1929

Propagate permissive mapping information into indexing pass #1929

shmsong commented Aug 24, 2022 •

edited

Loading

csarofeen left a comment

csarofeen Aug 25, 2022

csarofeen Aug 25, 2022

shmsong Aug 26, 2022

csarofeen Aug 25, 2022

naoyam Aug 27, 2022

Propagate permissive mapping information into indexing pass #1929

Propagate permissive mapping information into indexing pass #1929

Conversation

shmsong commented Aug 24, 2022 • edited Loading

csarofeen left a comment

Choose a reason for hiding this comment

csarofeen Aug 25, 2022

Choose a reason for hiding this comment

csarofeen Aug 25, 2022

Choose a reason for hiding this comment

shmsong Aug 26, 2022

Choose a reason for hiding this comment

csarofeen Aug 25, 2022

Choose a reason for hiding this comment

naoyam Aug 27, 2022

Choose a reason for hiding this comment

shmsong commented Aug 24, 2022 •

edited

Loading