[MetaSchedule] Fuse loops around shared to global store block in MultiLevelTilingTensorCore
#13357
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, vectorization of shared to global store in tensor core auto tensorization is not done properly, since most blocks have the
T.where
predicate which disables vectorization.The predicate is introduced after
Split
in cooperative fetch: https://github.com/apache/tvm/blob/main/src/meta_schedule/postproc/rewrite_cooperative_fetch.cc#L159-L162As the code says, this split is supposed to be applied to a fused loop. This is the case for cache read blocks, where
AddReadReuse
explicitly fuses loops around cache read blocks. ButAddWriteReuseTensorCore
doesn't fuse loops after cache write: https://github.com/apache/tvm/blob/main/src/meta_schedule/schedule_rule/multi_level_tiling_tensor_core.cc#L260-L262.So for cache rewrite blocks, we always try to split a single axis by large factors like
[None, 4, 32, 2]
. Unless the sampled factor for the axis is large, we always getT.where
in the shared to global copy block.This PR adds the missing fusion. Now, all candidate samples have the shared to global copy block properly vectorized. But unfortunately, there was no perf improvement from this change after e2e tuning.
For quantized workloads, vectorization of shared to global copy is disabled, since we end up vectorizing also requantization-related math, involving 64 bit arithmetic. The generated code fails to compile currently.
@vinx13 @junrushao