-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TIR] Add software pipelining #10066
Conversation
Co-authored-by: Junru Shao <junrushao1994@gmail.com> Co-authored-by: Xiyou Zhou <xiyou@octoml.ai> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com>
return std::move(load); | ||
} | ||
|
||
int GetWmmaFragmentSize(const Buffer& buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we decouple the logic with TensorCore? i.e. Let Software pipeline work for all possible backends
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is backend independent. However, it need to analyze and rewrite buffer accesses, including regular ones via BufferLoad
, BufferStore
and opaque accesses. For opaque accesses, we need to add specific rule like this one. If there are new backends or new intrinsics, the only thing need to do is to add another rule here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just came here to ask the exact same question :) I really want to see the decoupled implementation here if possible. I can imagine things becoming quickly messy as we add more backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making each specific backend inherit from this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the logic the admittedly backend independent, I believe that backend-specific logic should be split out as separate functionalities. For example, in this particular case, the refactoring we need is:
- Move this method into
FragmentInfo::GetSize
- The caller side uses
fragment_info_.at(old_buffer->data.get()).GetSize()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, we might want to consolidate all these special handling logic into special classes, where WMMA could be one of those instances
T.reads(A[tx, i, 0:16]) | ||
T.writes(C[tx, i, 0:16]) | ||
A_shared = T.alloc_buffer((16, 1, 16), dtype="float32", scope="shared") | ||
for j in T.serial(0, 16): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this example as well as the last two (nested_pipeline_interleaving
and nested_pipeline_double_buffer
), the same index variable is used in multiple loops. While not incorrect, it can make it hard to compare the pre- and post-transformed TIR because the variable (j
in the examples) could belong to multiple source loops. It might make the mapping clearer to use all unique index vars.
T.reads([A_shared[tx, 0, j]]) | ||
T.writes([A_local[(i + 1) % 2, 0, 0, j]]) | ||
T.block_attr({"double_buffer_scope": 0}) | ||
A_local[(i + 1) % 2, 0, 0, j] = A_shared[tx, i + 1, j] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the annotation "double_buffer_scope": 0
force buffer access to use index 0? Here it's using the normal results of the transform (i+1) % 2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double_buffer_scope
is a hint for buffer resizing inside the software pipeline. In some cases, the buffer doesn't need to be resized, this annotation forces it to be resized (double-buffered). double_buffer_scope: 0
means double buffer should be applied to 0-th buffer written by this block, i.e. A_local
. The buffer index is always (i + loop_offset) / num_versions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see; I misunderstood the "override" that was happening. The annotation stops the transform from "optimizing" away the rewrite to make it double-buffered. The argument refers to the i th buffer.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JosephTheOctonaut do you think we could improve our doc to make it less misleading? If so, would you love to suggest some change? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junrushao1994 Sure, I'd be happy to! I think I need a bit more time to finish understanding everything going on in the examples, but afterwards I'll try to put together some coherent suggestions or edits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great! Let’s work together to make our doc really slick :-)
PrimExpr VisitExpr_(const CallNode* op) final { | ||
// Intrinsic calls should be handled explicitly here as they are opaque accesses to | ||
// buffer. | ||
static const auto& load_matrix_sync = builtin::tvm_load_matrix_sync(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need to handle a few opaque intrinsics to make our analysis possible. Given the set of intrinsics could expand, do you think it's possible to generalize this mechanism and make the logic more clear?
Co-authored-by: Junru Shao <junrushao1994@gmail.com>
comments are addressed |
Hey thanks @vinx13 for the huge effort! The PR overall is in pretty good shape, but there are only one thing we need to further improve: I noticed that there are a few interesting special handlings for:
A good news to me is that all of the above are fundamentally used in only one method Let me know if it could work! Thanks a lot! |
@junrushao1994 the comment has been addressed. I've refactored all wmma related logics into a class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vinx13! This is huge effort!!
* [TIR] Add software pipelining Co-authored-by: Junru Shao <junrushao1994@gmail.com> Co-authored-by: Xiyou Zhou <xiyou@octoml.ai> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> * fix * fix * lint * fix * format * doc * remove print * lint * lint * doc * Apply suggestions from code review Co-authored-by: Junru Shao <junrushao1994@gmail.com> * address comments * address comments * refactor FragmentInfo::GetSize * remove unused * refactor * address comments Co-authored-by: Junru Shao <junrushao1994@gmail.com> Co-authored-by: Xiyou Zhou <xiyou@octoml.ai> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com>
This PR added a pass
InjectSoftwarePipeline
that transform a (annotated) loop into a pipelined one.Co-authored-by: Junru Shao junrushao1994@gmail.com
Co-authored-by: Xiyou Zhou xiyou@octoml.ai
Co-authored-by: Bohan Hou 32121147+spectrometerHBH@users.noreply.github.com
Co-authored-by: Siyuan Feng Hzfengsy@sjtu.edu.cn
Co-authored-by: Hongyi Jin 3231950289@qq.com
Co-authored-by: Ruihang Lai lairuihangdongdong@qq.com
@junrushao1994 @masahi @JosephTheOctonaut @Hzfengsy @spectrometerHBH @jinhongyii @MasterJH5574