Full indexing rework #260

csarofeen · 2020-08-01T15:50:28Z

Doing a lot of indexing rework.

Loop construction can be wrong when a TV doesn't have a dimension that its ComputeAt TV has.
Loop construction can be wrong when a ComputeAt TV merged an axis that another TV doesn't have that we're indexing into.
Unroll predicate can be wrong if consumers don't have the same broadcast patterns when they're in the same unroll loop. (Uncertain this is still true, may be fixed).
smem and lmem tensors are indexed wrong if their access isn't exactly the same when they're consumer vs when they're producer.
Collapsing needs a slight modification, I believe a merge is contiguous if all merges below are contiguous. Only checking merges are simply on ordered root domains can hit this issue:

TV0 = makeTensor(nDims=3)
TV0->merge(0, 2)
TV0->merge(0, 1)

Right now we'd say this is contiguous, but linear indexing on this is not the same as:

TV0 = makeTensor(nDims=3)
TV0->merge(0, 1)
TV0->merge(0, 1)

The latter should be contiguous merges, the former not.

Cleanup unused functions
Cleanup interface to tensorview/tensor domain around rfactor
Predicate modification when we have rfactor domain.
Fix disabled tests (python and cpp)

…it on tensors.

…issue mentioned in comments of: getBCastMergedIndices in index_compute.cpp

… into contiguity_cherry_pick

…ons we will use for index_compute.

…nst/non-const tensor dims.

…s hit.

… into contiguity_v3

jjsjann123 · 2020-08-10T19:57:10Z

I'm merging #277. which temporarily hides all the problem this PR is supposed to fix. We need to revert/undo the changes when testing the correctness of this PR.

Sorry for the inconvenience.

… into contiguity_merged

tlemo · 2020-08-11T22:10:45Z

test/cpp/jit/test_gpu.cpp


-    at::Tensor cg_output = at::empty({x, y, z}, options);
+void testGPU_FusionComplexBCast() {
+  {


can we split this into separate test functions?

Sure, would you please push a commit for it to this pr?

tlemo · 2020-08-11T22:13:11Z

torch/csrc/jit/codegen/cuda/ir_interface_nodes.h

+//
+// The reason we need both TensorView and TensorDomain is that we need to have a
+// record of both what is being computed and how it is being computed. For
+// Example we may have the operation: TV3[I, J, K] = TV2[I, J, K] + TV1[I, J, K]


nit: example

tlemo · 2020-08-11T22:14:30Z

torch/csrc/jit/codegen/cuda/ir_internal_nodes.h

    return rfactor_domain_;
  };
+  // If rfactor domain exists in domain() return it, otherwise return root


nit: add empty line

tlemo · 2020-08-11T22:16:22Z

torch/csrc/jit/codegen/cuda/ir_iostream.cpp

@@ -342,8 +342,18 @@ void IRPrinter::handle(const kir::NamedScalar* i) {
  os << i->name();
 }

-void IRPrinter::handle(const kir::IterDomain*) {
-  TORCH_INTERNAL_ASSERT(false, "Unreachable");
+void IRPrinter::handle(const kir::IterDomain* id) {


where is this needed? I hit this a few times while working on the refactoring, but every time it indicated that something else needed to be updated

Debugging matching between the IterDomain of kir and fusion was pretty challenging when I couldn't print what they were. No remaining code relies on it, but I rely on it for debugging. I had to update Type.cpp

I think we should now be able to fill the rest out marked as Unreachable

tlemo · 2020-08-11T22:17:21Z

torch/csrc/jit/codegen/cuda/iter_visitor.cpp

+}
+
+std::vector<Expr*> UnsortedExprs::getFrom(std::vector<Val*> outputs) {
+  if (outputs.empty())


tlemo · 2020-08-11T22:27:12Z

torch/csrc/jit/codegen/cuda/lower_utils.h

+// the life of this context guard.
+class TVDomainGuard {
+ public:
+  TensorView* tv_;


if public, no need for _ suffix

= nullptr (same for prev_domain)

Should be private, thanks.

tlemo · 2020-08-11T22:30:28Z

torch/csrc/jit/codegen/cuda/predicate_compute.cpp

@@ -25,18 +29,21 @@ std::vector<kir::Bool*> PredicateCompute::computePredicates(
    return {};
  }

+  std::vector<kir::Bool*> preds(root.size(), new kir::Bool(true));


clang-tidy?

I just wanted to pre-initialize before the loop below.

tlemo · 2020-08-11T22:30:37Z

torch/csrc/jit/codegen/cuda/predicate_compute.cpp

+    const std::vector<kir::ForLoop*>& loops,
+    kir::Bool* thread_pred) {
+  if (loops.empty())
+    return new kir::Bool(true);


tlemo · 2020-08-11T22:31:27Z

torch/csrc/jit/codegen/cuda/predicate_compute.h

+
+  void openLoop(kir::ForLoop*);
+
+  std::unordered_map<IterDomain*, kir::Bool*> predicates;


private: to separate private methods from private state

They're ordered, that may be the best you get from me.

tlemo · 2020-08-11T22:32:12Z

torch/csrc/jit/codegen/cuda/type.cpp

@@ -64,6 +64,16 @@ static const char* val_type2string(ValType t) {
      return "Scalar";
    case ValType::NamedScalar:
      return "NamedScalar";
+    case ValType::KirIterDomain:


is this really needed?

Yes, for ir_iostream, an IR that isn't printable isn't debug-able.

naoyam · 2020-08-12T00:50:16Z

torch/csrc/jit/codegen/cuda/iter_visitor.h

+// Simply grabs all exprs needed to produce provided outputs, only in dependency
+// order, not computeAtOrder like ExprSort in fusion.h
+class UnsortedExprs : public IterVisitor {


ExprSort doesn't do anything for computeAt anymore as the logic is moved to lower_loops.cpp. Perhaps, we don't need this class and can just use ExprSort.

naoyam · 2020-08-12T01:02:55Z

torch/csrc/jit/codegen/cuda/lower_unroll.h

@@ -98,7 +100,6 @@ class TORCH_CUDA_API UnrollPass : public OptOutDispatch {
  static std::vector<Expr*> runPass(
      Fusion* fusion,
      const std::vector<Expr*>& exprs,
-      const std::unordered_set<Expr*>& init_exprs,


This seems to remove a recent fix on predicates for initializing local buffers. See #246 and #255.

We can re-enable that fix without carrying init_exprs. We can detect initialization expressions by the consumer pattern and the current loop nest.

Could you open a new issue and I can do a follow up PR? I believe we still have correctness now.

You're right. The previous failing test still works fine, but I'm not sure whether it's intended behavior. I re-opened the issue (#64).

naoyam

As @csarofeen mentioned, predicate generation for initializing reduction buffers seems broken again. The original fix consists of #246 and #255.

naoyam · 2020-08-12T01:08:08Z

I did a quick review of all changes and left two comments. One is regarding the predicate generation for reduction buffers. Another is the newly added class, UnsortedExprs, which I think is just equivalent to ExprSort. I'll do a more thorough review later.

naoyam

Looks good to me for now. #64 may be an issue again.

csarofeen added 10 commits July 29, 2020 07:41

Add lower validation pass to make sure root broadcast dims aren't spl…

3db0417

…it on tensors.

Change global producer indexing so it's not based on consumer. Fixes …

cfa6596

…issue mentioned in comments of: getBCastMergedIndices in index_compute.cpp

Post cherry-pick cleanup.

30f1cc3

Merge branch '20_7_6_devel' of https://www.github.com/csarofeen/pytorch…

d6e9ecf

… into contiguity_cherry_pick

Fix loop nest structure for broadcast ops.

ef9cc9f

Merge branch '20_7_6_devel' of https://www.github.com/csarofeen/pytorch…

d7002ce

… into contiguity_cherry_pick

Clang.

19e9913

Move logic out of lower_loops to lower utils, add some utility functi…

2d08221

…ons we will use for index_compute.

Fix bug in ir build size map when we have strided bcast and mix of co…

7b33820

…nst/non-const tensor dims.

Rework global indexing. Still need to work on shared/local.

0b015ef

csarofeen mentioned this pull request Aug 1, 2020

[WIP] Indexing on contiguity information #225

Closed

3 tasks

csarofeen and others added 19 commits August 2, 2020 12:02

Fix allocation point for unrolling.

f69cfa0

Add option to iter visitor to stop traversal when a particular node i…

27bd56f

…s hit.

Make softmax examples more strictly correct.

5011cb4

Rework indexing, still need to rework predicates.

824c8ee

Split inline predicate generation from unroll predicate generation.

d49569a

Move inline predicate function to predicate_compute.

c829271

Remove thread predicate from unroll loops, error if they exist.

fa03401

Rework unrolling, cpp tests passing with contig and non-contig tensors.

36e6457

Some safety addition, prevent potential circular reference.

26cd9ce

Minor predicate generation fix.

bb458fc

Disable failing tests.

2eea59e

Merge branch '20_7_6_devel' of https://www.github.com/csarofeen/pytorch…

919dcd9

… into contiguity_v3

Merge branch '20_7_6_devel' into contiguity_v3

a2cc44d

quickly fixing tests scripts (looks like legit test failure)

1e1e90a

Minor fix to computeAt.

e3fb971

Remove outdated asserts.

04b60cc

Minor cleanup.

73013c5

Add some more complexity to simple bcast test.

85ed7aa

refactoring

d2fd38d

csarofeen added 4 commits August 10, 2020 12:01

Rework local/smem producer indexing.

cf2ec69

Rework consumer indexing for smem/lmem.

06a82cc

Rework contiguous indexing.

eb9579f

Fix ifdef in tests.

b84189d

csarofeen changed the title ~~Reworking indexing v3~~ Full indexing rework Aug 10, 2020

jjsjann123 mentioned this pull request Aug 10, 2020

For-Loop optimized path's predication appears broken with multiple operations and a broadcast first #273

Closed

csarofeen added 12 commits August 11, 2020 08:17

Remove RangeCompute as it's dead code.

1fe2380

Change inline predicate generation to new indexing method.

68c1f39

Move unroll predicate generation to new indexing method.

689ec00

Cleanup utility functions no longer needed.

92c9749

Cleanup routing of unused variable.

9603a48

Remove explicit tracking of initialization expressions.

1538a9d

Remove dead code in index_compute.

17cf6de

Remove print functions in index_compute.

03b629b

Improve error msg in index compute.

e1ea956

Merge branch '20_7_6_devel' of https://www.github.com/csarofeen/pytorch…

f7dc7ff

… into contiguity_merged

Last minute hot-fix for predicate generation.

07417c5

Clang and Flake.

8b40ef2

tlemo approved these changes Aug 11, 2020

View reviewed changes

naoyam reviewed Aug 12, 2020

View reviewed changes

naoyam requested changes Aug 12, 2020

View reviewed changes

review

688b016

naoyam mentioned this pull request Aug 12, 2020

Temporary array not initialized #64

Closed

naoyam approved these changes Aug 12, 2020

View reviewed changes

csarofeen merged commit 2dec1a5 into 20_7_6_devel Aug 12, 2020

csarofeen mentioned this pull request Aug 12, 2020

Initialization and predicate generation #288

Merged

csarofeen deleted the contiguity_v3 branch June 9, 2021 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full indexing rework #260

Full indexing rework #260

csarofeen commented Aug 1, 2020 •

edited

Loading

jjsjann123 commented Aug 10, 2020

tlemo Aug 11, 2020

csarofeen Aug 12, 2020

tlemo Aug 11, 2020

tlemo Aug 11, 2020

tlemo Aug 11, 2020

csarofeen Aug 12, 2020

csarofeen Aug 12, 2020

tlemo Aug 11, 2020

tlemo Aug 11, 2020

csarofeen Aug 12, 2020

tlemo Aug 11, 2020

csarofeen Aug 12, 2020

tlemo Aug 11, 2020

tlemo Aug 11, 2020

csarofeen Aug 12, 2020

tlemo Aug 11, 2020

csarofeen Aug 12, 2020

naoyam Aug 12, 2020

naoyam Aug 12, 2020

csarofeen Aug 12, 2020 •

edited

Loading

csarofeen Aug 12, 2020

naoyam Aug 12, 2020

naoyam left a comment

naoyam commented Aug 12, 2020

naoyam left a comment


		void openLoop(kir::ForLoop*);

		std::unordered_map<IterDomain, kir::Bool> predicates;

Full indexing rework #260

Full indexing rework #260

Conversation

csarofeen commented Aug 1, 2020 • edited Loading

jjsjann123 commented Aug 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csarofeen Aug 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam left a comment

Choose a reason for hiding this comment

naoyam commented Aug 12, 2020

naoyam left a comment

Choose a reason for hiding this comment

csarofeen commented Aug 1, 2020 •

edited

Loading

csarofeen Aug 12, 2020 •

edited

Loading