[TIR] Add software pipelining #10066

vinx13 · 2022-01-25T23:42:39Z

This PR added a pass InjectSoftwarePipeline that transform a (annotated) loop into a pipelined one.

Co-authored-by: Junru Shao junrushao1994@gmail.com
Co-authored-by: Xiyou Zhou xiyou@octoml.ai
Co-authored-by: Bohan Hou 32121147+spectrometerHBH@users.noreply.github.com
Co-authored-by: Siyuan Feng Hzfengsy@sjtu.edu.cn
Co-authored-by: Hongyi Jin 3231950289@qq.com
Co-authored-by: Ruihang Lai lairuihangdongdong@qq.com

@junrushao1994 @masahi @JosephTheOctonaut @Hzfengsy @spectrometerHBH @jinhongyii @MasterJH5574

Co-authored-by: Junru Shao <junrushao1994@gmail.com> Co-authored-by: Xiyou Zhou <xiyou@octoml.ai> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com>

include/tvm/tir/transform.h

tests/python/unittest/test_tir_transform_inject_software_pipeline.py

Hzfengsy · 2022-01-26T04:01:47Z

src/tir/transforms/inject_software_pipeline.cc

+    return std::move(load);
+  }
+
+  int GetWmmaFragmentSize(const Buffer& buffer) {


Can we decouple the logic with TensorCore? i.e. Let Software pipeline work for all possible backends

The logic is backend independent. However, it need to analyze and rewrite buffer accesses, including regular ones via BufferLoad, BufferStore and opaque accesses. For opaque accesses, we need to add specific rule like this one. If there are new backends or new intrinsics, the only thing need to do is to add another rule here.

Just came here to ask the exact same question :) I really want to see the decoupled implementation here if possible. I can imagine things becoming quickly messy as we add more backends.

How about making each specific backend inherit from this class?

While the logic the admittedly backend independent, I believe that backend-specific logic should be split out as separate functionalities. For example, in this particular case, the refactoring we need is:

Move this method into FragmentInfo::GetSize

The caller side uses fragment_info_.at(old_buffer->data.get()).GetSize()

Furthermore, we might want to consolidate all these special handling logic into special classes, where WMMA could be one of those instances

JosephTheOctonaut · 2022-01-26T19:37:07Z

tests/python/unittest/test_tir_transform_inject_software_pipeline.py

+                T.reads(A[tx, i, 0:16])
+                T.writes(C[tx, i, 0:16])
+                A_shared = T.alloc_buffer((16, 1, 16), dtype="float32", scope="shared")
+                for j in T.serial(0, 16):


In this example as well as the last two (nested_pipeline_interleaving and nested_pipeline_double_buffer ), the same index variable is used in multiple loops. While not incorrect, it can make it hard to compare the pre- and post-transformed TIR because the variable (j in the examples) could belong to multiple source loops. It might make the mapping clearer to use all unique index vars.

JosephTheOctonaut · 2022-01-26T19:54:48Z

tests/python/unittest/test_tir_transform_inject_software_pipeline.py

+                                T.reads([A_shared[tx, 0, j]])
+                                T.writes([A_local[(i + 1) % 2, 0, 0, j]])
+                                T.block_attr({"double_buffer_scope": 0})
+                                A_local[(i + 1) % 2, 0, 0, j] = A_shared[tx, i + 1, j]


Shouldn't the annotation "double_buffer_scope": 0 force buffer access to use index 0? Here it's using the normal results of the transform (i+1) % 2.

double_buffer_scope is a hint for buffer resizing inside the software pipeline. In some cases, the buffer doesn't need to be resized, this annotation forces it to be resized (double-buffered). double_buffer_scope: 0 means double buffer should be applied to 0-th buffer written by this block, i.e. A_local. The buffer index is always (i + loop_offset) / num_versions)

I see; I misunderstood the "override" that was happening. The annotation stops the transform from "optimizing" away the rewrite to make it double-buffered. The argument refers to the i th buffer.

Thank you!

@JosephTheOctonaut do you think we could improve our doc to make it less misleading? If so, would you love to suggest some change? Thanks!

@junrushao1994 Sure, I'd be happy to! I think I need a bit more time to finish understanding everything going on in the examples, but afterwards I'll try to put together some coherent suggestions or edits.

Sounds great! Let’s work together to make our doc really slick :-)

src/tir/transforms/inject_software_pipeline.cc

junrushao · 2022-02-07T00:09:43Z

src/tir/transforms/inject_software_pipeline.cc

+  PrimExpr VisitExpr_(const CallNode* op) final {
+    // Intrinsic calls should be handled explicitly here as they are opaque accesses to
+    // buffer.
+    static const auto& load_matrix_sync = builtin::tvm_load_matrix_sync();


Looks like we need to handle a few opaque intrinsics to make our analysis possible. Given the set of intrinsics could expand, do you think it's possible to generalize this mechanism and make the logic more clear?

Co-authored-by: Junru Shao <junrushao1994@gmail.com>

vinx13 · 2022-02-15T21:59:28Z

comments are addressed

junrushao · 2022-02-16T08:51:51Z

Hey thanks @vinx13 for the huge effort! The PR overall is in pretty good shape, but there are only one thing we need to further improve: I noticed that there are a few interesting special handlings for:

GetWmmaFragmentSize
RewriteWmmaFragmentIndex
tvm_load_matrix_sync
tvm_store_matrix_sync
tvm_mma_sync
tvm_access_ptr

A good news to me is that all of the above are fundamentally used in only one method RewriteOpaqueAccesses. To make sure future developers are able to easier extend the system, we would probably want to make sure that the pass itself is intrinsic agnostic, and the special handling logic is split into a separate method out of the class.

Let me know if it could work! Thanks a lot!

vinx13 · 2022-02-17T19:26:10Z

@junrushao1994 the comment has been addressed. I've refactored all wmma related logics into a class

junrushao

Thanks @vinx13! This is huge effort!!

* [TIR] Add software pipelining Co-authored-by: Junru Shao <junrushao1994@gmail.com> Co-authored-by: Xiyou Zhou <xiyou@octoml.ai> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> * fix * fix * lint * fix * format * doc * remove print * lint * lint * doc * Apply suggestions from code review Co-authored-by: Junru Shao <junrushao1994@gmail.com> * address comments * address comments * refactor FragmentInfo::GetSize * remove unused * refactor * address comments Co-authored-by: Junru Shao <junrushao1994@gmail.com> Co-authored-by: Xiyou Zhou <xiyou@octoml.ai> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com>

vinx13 requested review from areusch, comaniac, Hzfengsy, icemelon, jroesch, junrushao, kparzysz-quic, masahi, merrymercy, tqchen, yzhliu, zhiics and ZihengJiang as code owners January 25, 2022 23:42

vinx13 added 3 commits January 25, 2022 18:45

fix

0355e49

fix

8b06215

lint

d17e94c

jroesch reviewed Jan 26, 2022

View reviewed changes

include/tvm/tir/transform.h Show resolved Hide resolved

fix

5c497a8

yzh119 reviewed Jan 26, 2022

View reviewed changes

tests/python/unittest/test_tir_transform_inject_software_pipeline.py Show resolved Hide resolved

Hzfengsy reviewed Jan 26, 2022

View reviewed changes

vinx13 added 3 commits January 26, 2022 13:17

format

46f4df3

doc

dc5e81e

remove print

0e30987

junrushao mentioned this pull request Jan 26, 2022

[RFC][Tracking Issue] Meta Schedule (AutoTIR) #8473

Closed

62 tasks

vinx13 added 3 commits January 26, 2022 14:11

lint

0390b77

lint

a93fbe8

doc

f3b308c

JosephTheOctonaut reviewed Jan 26, 2022

View reviewed changes

junrushao reviewed Feb 2, 2022

View reviewed changes