Add extra configurability to `parallelizeAllLike` #1831

zasdfgbnm · 2022-07-15T06:24:35Z

New interface:

TORCH_CUDA_CU_API void parallelizeAllLike(
    TensorView* reference_tv,
    int64_t pos = -1,
    std::unordered_set<TensorView*> selected_tvs = {}, // empty means all tvs
    const std::unordered_set<ParallelType>& selected_parallel_types =
        {}, // empty means all parallel types
    bool propagate_padding = true);

Also added allTvsExcept and allParallelTypesExcept.

zasdfgbnm · 2022-07-15T20:02:52Z

Making this WIP, and I will add more features as described in #1836

zasdfgbnm · 2022-07-15T23:25:04Z

torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp

-    // Going to move inputs to consumers of inputs, need a copy as we'll modify
-    // the original.
-    {
-      auto vectorized_tvs_copy = vectorized_tvs;
-      for (auto inp : vectorized_tvs_copy) {
-        if (!inp->isFusionInput()) {
-          continue;
-        }
-        vectorized_tvs.erase(
-            std::find(vectorized_tvs.begin(), vectorized_tvs.end(), inp));
-        auto consumer_tvs = ir_utils::consumerTvsOf(inp);
-        vectorized_tvs.insert(
-            vectorized_tvs.end(), consumer_tvs.begin(), consumer_tvs.end());
+    std::unordered_set<TensorView*> vectorized_tvs;
+    for (auto tv : inputs_outputs) {
+      if (!tv->isFusionInput()) {
+        vectorized_tvs.emplace(tv);
+        continue;
      }
+      // move inputs to consumers of inputs
+      auto consumer_tvs = ir_utils::consumerTvsOf(tv);
+      vectorized_tvs.insert(consumer_tvs.begin(), consumer_tvs.end());
    }
-    // Clear vectorize on tensors that shouldn't have it
-    for (auto tv : all_tvs) {
-      if (std::find(vectorized_tvs.begin(), vectorized_tvs.end(), tv) ==
-          vectorized_tvs.end()) {
-        for (auto id : tv->domain()->domain()) {
-          if (id->getParallelType() == ParallelType::Vectorize) {
-            id->parallelize(ParallelType::Serial);
-          }
-        }
-      }
+    vectorize_id->parallelize(ParallelType::Vectorize);
+    scheduler_utils::parallelizeAllLike(
+        reference_tv, vectorized_tvs, {ParallelType::Vectorize});
+    if (vectorized_tvs.count(reference_tv) == 0) {
+      vectorize_id->parallelize(ParallelType::Serial);


This is just a cleanup of the code using the new functionality. There should be no behavioral change.

torch/csrc/jit/codegen/cuda/ir_utils.h

naoyam · 2022-07-16T00:35:26Z

torch/csrc/jit/codegen/cuda/scheduler/utils.h

@@ -51,7 +51,24 @@ size_t mergeNonReduction(

 TORCH_CUDA_CU_API void parallelizeAllLike(
    TensorView* reference_tv,
-    const std::vector<TensorView*>& all_tvs);
+    int64_t pos = -1,


Please add a comment. Propagation is done only for the first pos domains, right?

Added on top of parallelizeAllLike. And you are right, pos means selecting the first pos IDs.

torch/csrc/jit/codegen/cuda/scheduler/utils.cpp

naoyam

LGTM

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream pytorch#81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38543000](https://our.internmc.facebook.com/intern/diff/D38543000) Pull Request resolved: pytorch#83067 Approved by: https://github.com/davidberard98

Restrict parallelizeAllLike to selected TVs

afba4d3

zasdfgbnm requested a review from naoyam July 15, 2022 06:24

zasdfgbnm marked this pull request as draft July 15, 2022 20:02

zasdfgbnm added 3 commits July 15, 2022 14:58

save

a6c38ff

save

4aa5828

pointwise

154f598

zasdfgbnm changed the title ~~Restrict parallelizeAllLike to selected TVs~~ Add extra configurability to parallelizeAllLike Jul 15, 2022

zasdfgbnm added 2 commits July 15, 2022 15:54

save

8f5bb65

save

be5c101

zasdfgbnm marked this pull request as ready for review July 15, 2022 23:09

zasdfgbnm requested a review from shmsong July 15, 2022 23:09

zasdfgbnm commented Jul 15, 2022

View reviewed changes

naoyam reviewed Jul 16, 2022

View reviewed changes

torch/csrc/jit/codegen/cuda/ir_utils.h Outdated Show resolved Hide resolved

naoyam reviewed Jul 16, 2022

View reviewed changes

torch/csrc/jit/codegen/cuda/scheduler/utils.cpp Outdated Show resolved Hide resolved

resolve review comments

faefd85

naoyam approved these changes Jul 16, 2022

View reviewed changes

zasdfgbnm merged commit 2c9a6c0 into devel Jul 16, 2022

zasdfgbnm deleted the parallelizeAllLike branch July 16, 2022 01:17

zasdfgbnm added a commit that referenced this pull request Jul 17, 2022

Fix bug introduced in #1831

5f32fe9

zasdfgbnm mentioned this pull request Jul 17, 2022

Fix bug introduced in #1831 #1840

Merged

csarofeen pushed a commit that referenced this pull request Jul 18, 2022

Fix vectorization bug introduced in #1831 (#1840)

0b83645

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add extra configurability to `parallelizeAllLike` #1831

Add extra configurability to `parallelizeAllLike` #1831

zasdfgbnm commented Jul 15, 2022 •

edited

Loading

zasdfgbnm commented Jul 15, 2022

zasdfgbnm Jul 15, 2022

naoyam Jul 16, 2022 •

edited

Loading

zasdfgbnm Jul 16, 2022

naoyam left a comment

Add extra configurability to parallelizeAllLike #1831

Add extra configurability to parallelizeAllLike #1831

Conversation

zasdfgbnm commented Jul 15, 2022 • edited Loading

zasdfgbnm commented Jul 15, 2022

zasdfgbnm Jul 15, 2022

Choose a reason for hiding this comment

naoyam Jul 16, 2022 • edited Loading

Choose a reason for hiding this comment

zasdfgbnm Jul 16, 2022

Choose a reason for hiding this comment

naoyam left a comment

Choose a reason for hiding this comment

Add extra configurability to `parallelizeAllLike` #1831

Add extra configurability to `parallelizeAllLike` #1831

zasdfgbnm commented Jul 15, 2022 •

edited

Loading

naoyam Jul 16, 2022 •

edited

Loading