-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metaschedule] Auto tensorization for CPU / GPU dot product #11088
Conversation
if (Optional<String> intrin_name = | ||
tir::GetAnn<String>(block_sref, tir::attr::meta_schedule_auto_tensorize)) { | ||
std::string block_name = block_sref->StmtAs<tir::BlockNode>()->name_hint; | ||
if (block_name.find("init") == std::string::npos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DecomposeReduction
applied before this postproc copies meta_schedule_auto_tensorize
attributes to the init block as well. So we need to make sure that we won't try to tensorize a block even if it has meta_schedule_auto_tensorize
annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are target-specific handling here, ideally we can make the init block behavior configurable in meta schedule rule, it is fine for now
ICHECK(child_blocks.size() == 1); | ||
Array<LoopRV> init_loops = sch->GetLoops(child_blocks[0]); | ||
ICHECK(init_loops.size() == 1); | ||
sch->Vectorize(init_loops[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to above, since DecomposeReduction
introduces a new loop that should be vectorized on CPU, for now I'm applying vecotorization to the decomposed init loop here. This can also be done in RewriteReductionBlock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does postproc::RewriteParallelVectorizeUnroll
for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope it would, but it doesn't. Also since parallelization etc is supposed to be applied before DecomposeReduction
, I don't think running RewriteParallelVectorizeUnroll
after RewriteReductionBlock()
is a good idea. So vectorization of the init loop has to be done manually somehow.
I'd prefer vectoring in the init loop right after we run DecomposeReduction
during RewriteReductionBlock
, since vecotorization of the init loop should be done on CPU regardless of tensorization. cc @MasterJH5574
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! What’s the order of post-processors being applied now? Perhaps we should reflect this order by adding this post-processor to tune.py
tvm/python/tvm/meta_schedule/tune.py
Lines 159 to 170 in effc23d
@staticmethod | |
def _postproc() -> List[Postproc]: | |
from tvm.meta_schedule import postproc as M | |
return [ | |
M.DisallowDynamicLoop(), | |
M.RewriteCooperativeFetch(), | |
M.RewriteUnboundBlock(), | |
M.RewriteParallelVectorizeUnroll(), | |
M.RewriteReductionBlock(), | |
M.VerifyGPUCode(), | |
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue in question is vectorization for CPU targets. I'm using the default postprocs in
tvm/python/tvm/meta_schedule/tune.py
Lines 96 to 103 in effc23d
def _postproc() -> List[Postproc]: | |
from tvm.meta_schedule import postproc as M | |
return [ | |
M.DisallowDynamicLoop(), | |
M.RewriteParallelVectorizeUnroll(), | |
M.RewriteReductionBlock(), | |
] |
Since loop parallelization or vectorization checks for the "compact dataflow" constraint,
tvm/src/tir/schedule/primitive/for_kind.cc
Line 160 in 0ddaaa6
CheckSubtreeCompactDataflow(self, loop_sref); |
DecomposeReduction
in RewriteReductionBlock()
. So having RewriteParallelVectorizeUnroll
before RewriteReductionBlock()
in the default postprocs makes sense.
However, this is not sufficient to vectorize the init loop of reduction block, since it is generated during RewriteReductionBlock()
. I don't think we should run RewriteParallelVectorizeUnroll
again after RewriteReductionBlock()
(and it doesn't work anyway), so we need to manually vectorize the decomposed init loop in RewriteReductionBlock
or the new RewriteTensorize
postproc I added. I prefer the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I want to tensorize the reduction block. So before DecomposeReduction
is called, the loop kind of the reduction is serial
, which makes the decomposed init loop be serial
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So the block we want to tensorize wasn’t applied by the schedule rule ParallelVectorizeUnroll
as well 🤔?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes (otherwise tensorize pattern matching fails, because an intrin desc is always serial), I'm not exactly sure what prevents ParallelVectorizeUnroll
from tampering the block we want to tensorize (which is a good thing), maybe Blockize
I do at
tir::BlockRV outer_block = sch->Blockize(tiled_loop_rv.value()); |
(after tiling the inner loop nests to be tensorized) is helping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite interesting.. So here the case is, on one hand we don’t want the block being annotated by rule ParallelVectorizeUnroll
, but on the other hand we do want its init block to be vectorized after the decomposition. Am I right?
Since before decomposition the block wasn’t annotated by ParallelVectorizeUnroll
, the decomposed init block isn’t vectorized, which makes sense. In addition, the decomposed init block doesn’t have any information to indicate that it’s supposed to vectorized (e.g., it doesn’t have an “need vectorization” annotation). In this case, no matter we vectorize the init block loop in RewriteReductionBlock
or RewriteTensorize
, it’s all due to our human knowledge, which I don’t think is perfect.
For upstreaming, it might be okay to do manual vectorization in RewriteTensorize
(how does the vectorization in RewriteTensorize
bypass the compact dataflow issue BTW?). But in the long term I suppose we should enhance the compact dataflow check to allow such vectorization. After all, such vectorization won’t incur any incorrectness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite interesting.. So here the case is, on one hand we don’t want the block being annotated by rule ParallelVectorizeUnroll, but on the other hand we do want its init block to be vectorized after the decomposition. Am I right?
Exactly.
how does the vectorization in RewriteTensorize bypass the compact dataflow issue BTW?
That's a great question! Until recently, vectorization of the init loop after DecomposeReduction
was rejected by the compact dataflow check. I brought this topic to @Hzfengsy and the team came up with a relaxation of the constraint that allows vectorizing init loop. This is the PR #10705
Yeah, the ideally all outer loop parallelizations and inner loop vectorization can be done by one pass of ParallelVectorizeUnroll
, meaning we run it after DecomposeReduction
. Currently outer loop parallelization after DecomposeReduction
would be rejected by the compact dataflow check, but I think this is still too restrictive.
0d9e476
to
71d9ab5
Compare
I'm super excited to see this PR!! Would love to have some helping hands review this PR :-) CC: @vinx13 @spectrometerHBH |
Some perf numbers on int8 VNNI, rocketlake 6 core
RTX 3070 with DP4A (FP32 peak around 16 TFLOPS)
AMDGPU RX6600xt with DP4A (FP32 peak around 10 TFLOPS)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the efforts! Excited to see auto-tensorization happening!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update the list of post-processors here as well?
tvm/include/tvm/meta_schedule/postproc.h
Line 110 in effc23d
class Postproc : public runtime::ObjectRef { |
6d6c3b4
to
e104593
Compare
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org>
e104593
to
3d773f9
Compare
…1088) * [Metaschedule] Auto-tensorization for CPU / GPU dot product Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> * doc update * add vnni conv2d test * add dp4a test * adding tests for rewrite_tensorize * add rewrite_tensorize test * add missing pydoc * black * more doc * adding auto tensorize integration test * add dp4a test * fix target name * fix dtype in test * skip bert test * replace hard-coded llvm intrinsic id in test with look up * remove unnecessary include, add doc for the rest of params * update postproc.h * update doc * fix shape in te matmul workload * fix newline in cppdoc Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org>
Building on #11075, add
MultiLevelTilingWithIntrin
schedule rule andRewriteTensorize
postproc, which can be used for auto-tensorization with a single intrinsic, such as CPU / GPU dot product. This is the simplistic but non-trivial use of auto tensorization.The diff looks large but most of them are boilerplate from tests. The actual change to enable auto tensorization is about 300 lines.
MultiLevelTilingWithIntrin
can be used to auto-tensorize schedules with the following intrinsics. We should be able to deprecate corresponding manual templates in AutoTVM, but detail perf analysis is yet to be done.sdot
) (cc @tkonolige)dp4a
for cuda, SPIRV integer dot product for vulkan, and AMDGPU gfx10sdot4
for rocm.As a demonstration, I've add integration tests in
tests/python/integration/test_meta_schedule_auto_tensorize.py
, one of which is E2E auto-tensorzation on quantizedbert-base
x {VNNI, DP4A}. DP4A tests can also run on AMDGPU via vulkan or rocm backends (@mei-ye @tmoreau89).Co-authored-by: Siyuan Feng Hzfengsy@sjtu.edu.cn
Co-authored-by: Bohan Hou 32121147+spectrometerHBH@users.noreply.github.com
Co-authored-by: Hongyi Jin 3231950289@qq.com
Co-authored-by: Ruihang Lai lairuihangdongdong@qq.com
Co-authored-by: Wuwei Lin wuwei@apache.org
@junrushao1994 @vinx13 @comaniac @mbrookhart @spectrometerHBH @Hzfengsy @MasterJH5574 @jinhongyii