-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MetaSchedule] Improve inlining and VerifyGPUCode
for quantized model workload
#13334
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
6e62f2a
to
337c1c1
Compare
Hey thanks for the contribution! I was a bit uncertain if we really want to do name checking to determine constants from the compile engine, because it relies on the assumption that relay exists and relay always use There is an alternative I could come up with, and please let me know if it makes sense: Add a tvm/src/relay/backend/te_compiler_cache.cc Line 275 in fbe174b
T.block_attr({"schedule_rule": "compute_inline"}) Then register a PackedFunc Let me know if it makes sense! |
@junrushao I like your idea, I'll rework this. |
@junrushao I realized that an easier way would be to check the content of the block to determine if it is a constant block, rather than relying on the block name. |
337c1c1
to
f398453
Compare
Removed the identification of constant blocks by name, and replaced it with more robust method based on the block structure. |
cc @vinx13 @junrushao please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…el workload (apache#13334) * [MetaSchedule] Add a new schedule rule to inline all scalar constants * add doc * reorg * identify constant block by its structure, not by name
This can be inlined by existing
AutoInline
rule, but depending on the order where spatial blocks are processed byAutoInline
, these "compile_engine_const" blocks can get in the way ofReverseComputeInline
on other blocks, since the constant blocks are also counted as a producer block.PostOrderApply
currently processes the constant blocks at the very end, soReverseComputeInline
on blocks that consumes such constants always fails to inline. So in practice, we are not generating a fused kernel for quantized conv2d today.I added a simple inlining rule that inlines only such constant blocks. This rule is supposed to run before
AutoInline
, to unblockReverseComputeInline
. This lets us generate a fused kernel. On the int8 resnet50 model from PyTorch, the e2e perf improved from 6.8 to 5.2 msec, using batch size 16, and the same number of trials.VerifyGPUCode
only checks vector width used inBufferLoad
andBufferStore
. But quantized models uses specialized intrinsics likeq_multiply_shift_per_axis
below, which uses 64 bit arithmetic internally.To accurately account for data types used in a block, we need to lower those intrinsics before invoking TIR
VerifyGPUCode
and check the dtype ofCastNode
.@vinx13 @junrushao @zxybazh