Refactor lower_alias_memory.cpp #2170

naoyam · 2022-11-09T00:11:00Z

Just refactored the alias analysis pass without change the underlying logic.

Changed the analysis to use the visitor and mutator to reduce duplicated traversal logic. The pass was written before kir::IrVisistor and kir::ExprMutator were added.

Also, previous classes such as BufferUseDefInfo does two separate traversals, which are now two separate classes (ScopeMap and AllocationInfoMap).

There should be no logic change except for the fix of #2117, which was because the aliasing modification pass was done twice when the debug print was enabled. I just dumped all the alias pairs from all the C++ and Python tests, and confirmed every test has the same alias pairs before and after this PR.

I'm skeptical if this aliasing pass could be actually helpful for local tensors as they should be always statically sized and statically indexed, so nvcc should be able to have the same liveness information. We still need to do this for shared memory tensors, though.

This PR will be followed up with a fix for #2163.

torch/csrc/jit/codegen/cuda/kernel_ir.h

torch/csrc/jit/codegen/cuda/lower2device.cpp

torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp

naoyam · 2022-11-09T05:55:12Z

torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp

-  if (debug_print) {
-    AllocateReuseModifier::debugPrint(exprs);
-  }
-  AllocateReuseModifier::modify(exprs);


Not sure why, but previously when the debug print is enabled, the analysis is executed twice. That's actually not supported as the analysis assumes anything already aliased is an output that is aliased to an input (https://github.com/csarofeen/pytorch/pull/2170/files#diff-37be3f6d58243db8255519e2eb487975a21cfe8c030c937ae169e4367f5ef6e1R682)

torch/csrc/jit/codegen/cuda/lower_insert_syncs.cpp

torch/csrc/jit/codegen/cuda/runtime/grid_sync.cu

zasdfgbnm

I have not finished reading yet. Just posting my comments so far. They are minor comments, I don't have any major concerns so far.

torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp

zasdfgbnm · 2022-11-09T23:59:02Z

torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp


 //! Reuse Allocation nodes via pointer aliasing
-class AllocateReuseModifier {
+class ReusableAllocationFinder : private kir::IrVisitor {


Just to check my understanding:

The detection of reusability is based on the line number. Except for checking visibility in the scope, we are just ignoring the loops. So this approach assumes that the allocation in an outer scope is only used by the first iteration of an inner loop? This assumption sounds strange to me. For example, if I have:

float T1[2]; // compute T1[2] for i in range(2): something = T1[i]; // no more read of T1 below this point. float T2[2]; // this will reuse T1??? Is it correct?

See an issue here: #1785, but I am still concerned. I don't think broadcasting resolution is the only case we should worry about. For example, I can have code:

float T1[2]; compute T1[2] for i in range(2): something = T1[i]; // no more read of T1 below this point. float T2[2]; // this will reuse T1 read T2 from fusion input something_else = reduction(T2) fusion_output[index] = something + something_else

I think this case would be covered by isSerialBroadcastResolution, but not 100% sure. Can you create a repro and file an issue? I want to focus just refactoring in this PR. There's one bug I'm fixing in #2174.

Yeah, you didn't change the algorithm in this PR, so this is not a blocking comment. I am just generally worried about the algorithm and will try to think if I can break it.

OK, I think the above case I mentioned is covered in

//! Careful heavy check on inner sharing candidates, //! current enforced conditions are: //! //! 1. The two buffers have producer-consumer relationship

zasdfgbnm

I finished reading. Thanks for refactoring! I only have some minor suggestions, and except for that, it looks good to me.

zasdfgbnm · 2022-11-10T19:31:01Z

torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp

  //! Realize alias of two buffers through inner alias analysis and
  //!  keep track of the re-use.
-  void useInnerAlias(AllocationInfoPtr from, AllocationInfoPtr to) {
+  void useInnerAlias(AllocationInfo* from, AllocationInfo* to) {


nit: I personally feel that from and to are confusing. I don't know who is aliasing who, and I need to look around to figure it out everytime. Would it make sense to reorder the parameter and name it as something like (x, alias_of_x)?

Added comments. Hope it's better now.

Adding comments looks good to me

zasdfgbnm · 2022-11-10T19:44:41Z

torch/csrc/jit/codegen/cuda/lower_alias_memory.cpp

+    // The alias-to Allocate may be replaced with a new
+    // Allocate. Use the new one in that case.


Is this for supporting 2-hop aliasing? I believe we have a comment

// Avoid 2-hop aliasing for simplicity. Can support if really need in // extreme cases.

If we don't need to support 2-hop aliasing, then we can just remove the old2new_ map from this class?

Looks like that shouldn't happen. The previous version of this code modifies kir::Allocate, but we are now replacing the pointers, so the link from AllocationInfo to kir::Allocate may be broken. If we never alias an aliased tensor, that should be fine. I'd rather keep the map and assert that condition. Will update the code.

naoyam added 7 commits November 4, 2022 22:37

Refactoring of lower_alias_memory

0732580

Error msg

dd7d352

WIP: further refactoring using ExprMutaor

6cd8a9d

clang-format

b0ce9dc

cleanup

78eef72

format

664c907

cleanup

d1fa87a

naoyam force-pushed the fix_mem_reuse branch from 4972c14 to d1fa87a Compare November 9, 2022 05:37