-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MetaSchedule][ARM] Enable ARM CPU intrinsic for MetaSchedule #14209
[MetaSchedule][ARM] Enable ARM CPU intrinsic for MetaSchedule #14209
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave comments here to think about how to make the code cleaner, but they are not required to fix
@@ -180,7 +180,7 @@ class ScheduleRule : public runtime::ObjectRef { | |||
* \return The schedule rule created | |||
*/ | |||
TVM_DLL static ScheduleRule MultiLevelTilingWithIntrin( | |||
String intrin_name, String structure, Optional<Array<String>> tile_binds, | |||
Array<String> intrin_name, String structure, Optional<Array<String>> tile_binds, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why it is done (String -> Array), but it should be rethink one more time due to API changing affects other places not only your own task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, new API changes were reverted to the original ones, while keeping the same new functionality.
@@ -85,21 +101,23 @@ class MultiLevelTilingWithIntrinNode : public MultiLevelTilingNode { | |||
|
|||
public: | |||
/*! \brief The name of a tensor intrinsic. */ | |||
String intrin_name; | |||
Array<String> intrin_name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the field type is still be changed, I recommend to rename it to intrin_names
for the sake of clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated.
@@ -110,6 +155,16 @@ void SpaceGeneratorNode::InitializeWithTuneContext(const TuneContext& context) { | |||
default_sch_rules = ScheduleRule::DefaultMicro(); | |||
default_postprocs = Postproc::DefaultMicro(); | |||
default_mutator_probs = Mutator::DefaultMicro(); | |||
} else if (kind == "neon") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like different levels of target types are checked here. Possibly it should be "arm" type with splitting to "neon"/"dotprod" in separated method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// return HasFlag_(attr.value(), flag); | ||
// } | ||
|
||
static inline bool HasFlag_(Optional<Array<String>> attr, std::string flag) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have the same code in src/target/parsers/aprofile.cc
.
Instead of code duplication, can we move it into common place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code duplication fixed, we are now using different method to pull specific keys from the target.
ScheduleRule::AddRFactor( | ||
/*max_jobs_per_core=*/8, | ||
/*max_innermost_factor=*/Integer(32)), | ||
ScheduleRule::MultiLevelTilingWithIntrin( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, new API in MultiLevelTilingWithIntrin
is not required anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is not in use rn.
c3cc7bc
to
5baa4fd
Compare
@ibsidorenko could you review my changes, please ? :) |
vec_c = C.vload([0], dtype="int32x4") | ||
|
||
C[T.ramp(T.int32(0), 1, 4)] = T.call_llvm_pure_intrin( | ||
T.llvm_lookup_intrinsic_id("llvm.aarch64.neon.udot.v4u32.v16u8"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use the same intrinsic id "llvm.aarch64.neon.udot.v4u32.v16u8" in both cases. Is it Ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When experimenting with tflite_mobilenet_v3_quant model, we encountered convolution with multiplication of uint8 uint8 tensors into int32 accumulator, which did not allow to apply existing sdot/udot intrinsics, so we had to create new hdot intrinsic, which already works with such dtypes layout. From my knowledge, there is no such neon instruction to work with u8u8i32 layout of dtypes, for that reason we could try to call the closest instruction, which we did and the intrinsic succesfully applied to that type of convolution, bringing us a performance benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
vec_a, | ||
vec_b, | ||
dtype="int32x4", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to clean up a lot of code duplication between different dtypes here. See tensor_intrin/cuda.py
for examples.
}; | ||
} | ||
|
||
Array<ScheduleRule> ScheduleRule::DefaultARMDotprod() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the dup with DefaultARMDotprod
, to make the difference obvious.
7439490
to
d08f54c
Compare
d08f54c
to
615d61f
Compare
} | ||
|
||
template <typename... Args> | ||
static void AgregateImpl(Array<T>& dest) {} // NOLINT(*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This edit (NOLINT) was made on the consideration that quote "Google C++ Style Guide seems to have allowed using non-const references as parameters".
Reference: https://github.innominds.com/cpplint/cpplint/issues/148
@tvm-bot rerun |
1 similar comment
@tvm-bot rerun |
@masahi, could you review my changes, please? :) |
}; | ||
} | ||
|
||
Array<ScheduleRule> GetDotprodSpecificRules() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specific to "ARM" dot product only so the naming is not the best. I'll merge it for now, but please fix this when you get a chance later.
sorry I forgot to take another look |
Motivation:
The purpose of this PR is to add support for intrinsics to optimize matrix multiplication operations (e.g. matmul, convolution) during tuning with MetaScheduler.
Information about PR:
The present PR integrates the existing neon and dotprod (namely, sdot and udot) ARM CPU intrinsics into MetaScheduler, introduces a new "hybrid" dotprod intrinsic ("hdot") working with uint8, uint8 -> int32 data types, and changes the intrinsic selection and application processes for the ARM CPU case, since we operate with multiple intrinsics, rather than with a specific one.