Some misc cleanups/refactor split out from #1854 #1867

zasdfgbnm · 2022-07-26T00:45:50Z

Split out from #1854

The InlinePropagatorSelector seems to be less generally useful than BoundedPropagationSelector, so I made InlinePropagatorSelector a private class of compute_at.cpp and renamed it to ComputeAtSelector, and moved BoundedPropagationSelector to maxinfo_propagator.h and renamed it to SetSelector.
Split DomainMap from pointwise.cpp into pointwise_utils.cpp, and renamed some functions.
Add two cache entry: DOMAIN_MAP and REFERENCE_TENSORS, and use them to in the pointwise scheduler.

…intwise

zasdfgbnm · 2022-07-26T00:48:52Z

torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp

+}
+
+// Determine if all IterDomains in input are mapped to output
+bool DomainMap::areAllInputIdsMappedToOutput(


Renamed from areAllMapped, see #1854 (comment)

zasdfgbnm · 2022-07-26T00:52:32Z

torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp

+// Currently this function only allow having one view on the path from input to
+// output. If there are multiple views, then likely the pointwise scheduler will
+// reject the fusion because we can not correctly find a reference tensor.


This comment is newly added, see #1854 (comment)

zasdfgbnm · 2022-07-26T00:52:44Z

torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.cpp

+// Currently this function only allow having one view on the path from input to
+// output. If there are multiple views, then likely the pointwise scheduler will
+// reject the fusion because we can not correctly find a reference tensor.
+void DomainMap::eraseIfInputMappedThroughViewToOutput(


Renamed from eraseIfMappedThroughView

torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h

shmsong · 2022-07-26T04:55:06Z

torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h

+class DomainMap {
+ public:
+ DomainMap(Fusion* fusion);
+ virtual ~DomainMap() = default;


This would be a very useful cache entry and I do expect it to be useful in many coming scheduler variants.

Also just wondering, why does this need to be virtual? Meanwhile, could you add an interface function that exposes the underlying ca_map_ ? That'd be very helpful in many scenarios.

The compute at map seems a bit dangerous to expose directly during scheduling as it will become out of date during scheduling. Though if we need the unscheduled ca_map_ that could make sense.

Yes was hoping that this would cover the unscheduled ca_map usage for the heuristic cache entry. If this domain map is also used in scheduling phase, it'd need to be guarded.

We could just revisit when other schedulers need to use ca_map_ as well.

I think data caching is not available in the scheduling phase, so whenever you need it in scheduling, you rebuild it.

Added a

const ComputeAtMap &getComputeAtMap() const { return ca_map_; }

Also just wondering, why does this need to be virtual?

I would expect different schedulers to subclass this, but uses the same cache entry for caching. For this to work, I would need dynamic_cast, which requires the base class to be polymorphic.

shmsong · 2022-07-26T05:01:11Z

torch/csrc/jit/codegen/cuda/scheduler/compile_time_info.h

@@ -24,6 +25,8 @@ namespace HeuristicCompileTime {

 //! Enum for all possible types of cached entries of compile-time info.
 enum class CompileTimeEntryType {
+ DOMAIN_MAP,
+ REFERENCE_TENSORS,


Would appreciate more specific namings for the new entries.

DOMAIN_MAP probably ok if it can expose both ca_map and root_map to all schedulers.

REFERENCE_TENSORS sounds quite likely to have naming collision with other schedulers.

Seems fine to me for now, we should mark this as a todo.

REFERENCE_TENSORS is a vector of tensors, and it is intended to be shared by many schedulers. For example, the pointwise scheduler can cache {reference_tv} and the transpose scheduler can cache {reference1, reference2}

shmsong · 2022-07-26T05:03:05Z

torch/csrc/jit/codegen/cuda/scheduler/pointwise.cpp

+ std::vector<TensorView*> data{domain_map.findReferenceTensorView()};
+ return std::make_unique<std::vector<TensorView*>>(std::move(data));
+ });
+ TensorView* largest_out = largest_out_entry.get()[0];



Thanks for caching these entries here. This path looks lightweight enough to me now.

csarofeen

This seems good enough to me as is, definitely improvement on current state. I think we should take it so we can keep moving forward with the transpose scheduler.

csarofeen · 2022-07-26T17:57:40Z

torch/csrc/jit/codegen/cuda/scheduler/compile_time_info.h

@@ -24,6 +25,8 @@ namespace HeuristicCompileTime {

 //! Enum for all possible types of cached entries of compile-time info.
 enum class CompileTimeEntryType {
+ DOMAIN_MAP,
+ REFERENCE_TENSORS,


Seems fine to me for now, we should mark this as a todo.

torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h

csarofeen · 2022-07-26T18:07:04Z

torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h

+class DomainMap {
+ public:
+ DomainMap(Fusion* fusion);
+ virtual ~DomainMap() = default;


The compute at map seems a bit dangerous to expose directly during scheduling as it will become out of date during scheduling. Though if we need the unscheduled ca_map_ that could make sense.

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream pytorch#81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38543000](https://our.internmc.facebook.com/intern/diff/D38543000) Pull Request resolved: pytorch#83067 Approved by: https://github.com/davidberard98

zasdfgbnm added 10 commits July 24, 2022 21:29

Cleanup copy-pasted code in tests

68425c2

Refactor heuristics params to make it more extensible

524497e

cleanup

eb00781

cleanup english

4eedbe7

partially refactor pointwise scheduler

45a2ff6

selector

2ce327a

Merge branch 'devel' of github.com:csarofeen/pytorch into refactor-po…

4496c58

…intwise

cleanup

8c47a7c

renaming, doc

d2d49bb

remove InlinePropagatorSelector

55db67b

zasdfgbnm commented Jul 26, 2022

View reviewed changes

zasdfgbnm mentioned this pull request Jul 26, 2022

Transpose scheduler, step 1 #1854

Merged

4 tasks

zasdfgbnm commented Jul 26, 2022

View reviewed changes

zasdfgbnm requested review from csarofeen, naoyam and shmsong July 26, 2022 01:00

naoyam reviewed Jul 26, 2022

View reviewed changes

torch/csrc/jit/codegen/cuda/scheduler/pointwise_utils.h Show resolved Hide resolved

shmsong reviewed Jul 26, 2022

View reviewed changes

csarofeen approved these changes Jul 26, 2022

View reviewed changes

zasdfgbnm added 2 commits July 27, 2022 10:33

add getComputeAtMap

4ead466

lint

5cf8748

zasdfgbnm merged commit d1caf33 into devel Jul 27, 2022

zasdfgbnm deleted the refactor-pointwise branch July 27, 2022 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some misc cleanups/refactor split out from #1854 #1867

Some misc cleanups/refactor split out from #1854 #1867

zasdfgbnm commented Jul 26, 2022

zasdfgbnm Jul 26, 2022

zasdfgbnm Jul 26, 2022

zasdfgbnm Jul 26, 2022

shmsong Jul 26, 2022

shmsong Jul 26, 2022

csarofeen Jul 26, 2022

shmsong Jul 26, 2022

shmsong Jul 26, 2022

zasdfgbnm Jul 27, 2022

zasdfgbnm Jul 27, 2022

zasdfgbnm Jul 27, 2022

shmsong Jul 26, 2022

csarofeen Jul 26, 2022

zasdfgbnm Jul 27, 2022

shmsong Jul 26, 2022

csarofeen left a comment

csarofeen Jul 26, 2022

csarofeen Jul 26, 2022

Some misc cleanups/refactor split out from #1854 #1867

Some misc cleanups/refactor split out from #1854 #1867

Conversation

zasdfgbnm commented Jul 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csarofeen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment