-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to make tensorize work. These changes also fix the previously broken test. #3981
Conversation
broken test. Summary: Tensorize was breaking for a few reasons. 1) Assert at: src/op/tensorize.cc:234 CHECK(is_one(e.region[j]->extent)) In some cases this cannot be proven, e.g.: expected shape=[16, 4], given region=[range(min=((ax1.outer*16)/16), ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)), range(min=((k.outer*4)/4), ext=(((((k.outer*4) + 3)/4) + 1) - k.outer)), range(min=0, ext=16), range(min=0, ext=4)] The unprovable one is: ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)). This can be simplified but it is not because to simplify divide, it must prove ax1.outer > 0 and since it is var it cannot. The fix for this to just find all the vars in expr in relace them with some const value. 2) Equivalence between tensorized expr and one being asked to tensorize. For example, the error would be. TVMError: Check failed: Equal(lhs, rhs): Failed to match the compute with TensorIntrin tensor_intrin's declaration provided= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0), intrin= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0) Difference is mainly in the source part: source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))] source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))] This was not being simpifiled due to compute_intrin_iter_space (map for iter var to range) not containing leaf iter vars. 3) Here it fails with: Check failed: is_one(Simplify(value->shape[i])): Argument b_buffer shape mismatch[16, 4] vs [(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer), (((((k.outer*4) + 3)/4) + 1) - k.outer), 16, 4] This is in buffer binding where it thinks expected and buffer bound shape is different. Although if we could simplify expr, this would not be the case. Test Plan: On skylake avx512 machine: python tests/python/contrib/test_gemm_acc16.py Reviewers: Subscribers: Tasks: Tags:
cc: @tqchen. I have attempted a fix here. Please take a look. I am particularly curious if the way I have attempted to simplify index expression is correct? I might have missed something. Thanks! |
I think a better fix would be to make a better use of Analyzer. Because the simplification depends on the context(of var being nonnegative), we could populate the range information of the related axis before we call analyzer->Simplify, this should resolve the problem. https://github.com/dmlc/tvm/blob/master/include/tvm/arithmetic.h#L126 Can you look into if we can take this better approach? |
Sure. Let me look into it. |
cd9a849
to
517bbe9
Compare
@tqchen, I moved to using domain map for itervar everywhere. This should be better right? |
Thanks for the link. I will go through it and provide any feedback if I have. |
src/codegen/build_module.cc
Outdated
@@ -392,7 +392,7 @@ Stmt BuildStmt(Schedule sch, | |||
|
|||
// Phase 1 | |||
stmt = ir::StorageFlatten(stmt, out_binds, 64, | |||
config->instrument_bound_checkers); | |||
config->instrument_bound_checkers, bounds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to get bounds from the previous scheduling stage, instead, we could get them from the range of the loops an bind these information with analyzer->Bind related https://github.com/dmlc/tvm/blob/master/src/arithmetic/ir_mutator_with_analyzer.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tqchen, ir mutator would imply you want to mutate exprs whereas all we want is the bound information. Can we not just get new bounds after ScheduleOps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that we have bounds information from the new schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, i do not mean we should reuse the IR mutator, but simply means we could populate the bounds in the same way as the IRMutatorWithAnalyzer in the visitor.
The main reason to re-populate these bounds is to make sure that StorageFlatten stand on its own as a pass. As these information are already available in the context. So that we can apply the same pass for code that may not have bound information available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tqchen, to make sure, you are suggesting that we populate bounds inside StorageFlatten and now pass it from outside? Sorry to drag this, but just want to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what i mean is to populate the bounds(from the loops) inside the StorageFlatten pass, so the pass itself can be self-contained without relying out the external information :) . Thanks for the clarification questions. It is great that we can do these kind of discussions in public, as the apache way always encourages public discussions and allow other developers to follow the development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tqchen, updated the PR. Hopefully this is akin to what you had in mind :).
statements binds the bound of the analyzer. Later this is used to simplify expressions. Inspired from ir_mutator_with_analyzer Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
517bbe9
to
cb8614a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments, getting close
include/tvm/bounded_analyzer.h
Outdated
} | ||
|
||
/*! \brief internal analyzer field. */ | ||
arith::Analyzer analyzer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make the field protected, analyzer_ to be consistent with google c format
include/tvm/bounded_analyzer.h
Outdated
namespace tvm { | ||
namespace ir { | ||
|
||
class BoundedAnalyzer final : public IRVisitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about IRVisitorWithAnalyzer to be consistent with IRMutatorWithAnalyzer, let us move the header into src/arithmetic for now as we want to minimize the amount of public headers(which means we promise stable APIs)
for (size_t i = 0; i < e.start; ++i) { | ||
CHECK(is_one(e.region[i]->extent)) | ||
for (size_t j = 0; j < e.start; ++j) { | ||
auto canonical_extent = Simplify(e.region[j]->extent, *compute_intrin_iter_space); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have a analyzer, we don't have to call Simplify, but instead we can call analyzer_->Simplify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually here, I think it is going to be tricky. I suppose that is the reason dom_map constructed and passed from outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative, we could create an Analyzer outside, populate the iteration space(via binding) using dom_map, and analyzer->Bind, then pass the analyzer pointer in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good for consolidating all the simplify stuff, however I think that would be a bit of hefty change and I was wondering if we can split it into a different PR.
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
@tqchen, updated PR and left a comment. |
Please look into the CI lint error(missing ASF header) |
more cpplint errors, you should be able to repro locally via make cpplint |
Aah my bad. Let me fix this once and for all. |
7e69ef1
to
5d193cc
Compare
…R_VISITOR_WITH_ANALYZER_H_ Some lint fixes as well.
5d193cc
to
5e40764
Compare
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
@tqchen, seems like it passed all the checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final follow up comments to see if we can consolidate a bit more to the analyzer. Thanks for making improvements to the code :) we are getting very close
@@ -128,7 +128,7 @@ void ArgBinder::BindBuffer(const Buffer& arg, | |||
CHECK(fuzzy_match) << "Argument " << arg_name << " size mismatch"; | |||
size_t diff = value->shape.size() - arg->shape.size(); | |||
for (size_t i = 0; i < diff; ++i) { | |||
CHECK(is_one(value->shape[i])) | |||
CHECK(is_one(Simplify(value->shape[i]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary? as most cases we eagerly simplify the constants, while calling simplify is certainly more powerful, it would be great if we can move such simplification to the place that produces this expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did hit into this assert, thats why the change else I dont try to eagerly fix things which are not a problem. Do you want me to reproduce the error and paste here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we can do that as a followup PR if you can dig a bit further.
for (size_t i = 0; i < e.start; ++i) { | ||
CHECK(is_one(e.region[i]->extent)) | ||
for (size_t j = 0; j < e.start; ++j) { | ||
auto canonical_extent = Simplify(e.region[j]->extent, *compute_intrin_iter_space); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative, we could create an Analyzer outside, populate the iteration space(via binding) using dom_map, and analyzer->Bind, then pass the analyzer pointer in.
src/pass/storage_flatten.cc
Outdated
* However there is no copy/move operator on analyzer that safely copies | ||
* or moves data. Perhaps we should disable copy operator and implement | ||
* move operator. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tqchen, btw in case you missed this comment of mine here, I think we would want to file task/issue to fix Analyzer so as to not have to resort to shared_ptr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IRVisitorWithAnalyzer analyzer
StorageFlattener(..., &analyzer,).Mutate()
Can we simply pass in raw pointer of IRVisitorWithAnalyzer* into StorageFlattener? given that we do not need to share the life-cycles between here and the flattener.
disable copy constructor sounds good
@tqchen, I responded to your comments. Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can defer the rest to a followup PR, let us change the shared_ptr
src/pass/storage_flatten.cc
Outdated
* However there is no copy/move operator on analyzer that safely copies | ||
* or moves data. Perhaps we should disable copy operator and implement | ||
* move operator. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IRVisitorWithAnalyzer analyzer
StorageFlattener(..., &analyzer,).Mutate()
Can we simply pass in raw pointer of IRVisitorWithAnalyzer* into StorageFlattener? given that we do not need to share the life-cycles between here and the flattener.
disable copy constructor sounds good
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
49abeb5
to
05ac0fd
Compare
@tqchen, made the requested changes, but note that using raw pointer is little risky. We have to guarantee that it is not forwarded to someone else whose lifetime extend the object to which raw pointer is passed. |
yap, the use of raw pointer is fine as long as we do not use it to retain ownerships and extend the life-cycle(treat it as a weakref), which is our case |
@tqchen, can you merge this please? |
Thanks @kimishpatel ! |
… broken test. (apache#3981) * Changes to make tensorize work. These changes also fix the previously broken test. Summary: Tensorize was breaking for a few reasons. 1) Assert at: src/op/tensorize.cc:234 CHECK(is_one(e.region[j]->extent)) In some cases this cannot be proven, e.g.: expected shape=[16, 4], given region=[range(min=((ax1.outer*16)/16), ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)), range(min=((k.outer*4)/4), ext=(((((k.outer*4) + 3)/4) + 1) - k.outer)), range(min=0, ext=16), range(min=0, ext=4)] The unprovable one is: ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)). This can be simplified but it is not because to simplify divide, it must prove ax1.outer > 0 and since it is var it cannot. The fix for this to just find all the vars in expr in relace them with some const value. 2) Equivalence between tensorized expr and one being asked to tensorize. For example, the error would be. TVMError: Check failed: Equal(lhs, rhs): Failed to match the compute with TensorIntrin tensor_intrin's declaration provided= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0), intrin= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0) Difference is mainly in the source part: source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))] source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))] This was not being simpifiled due to compute_intrin_iter_space (map for iter var to range) not containing leaf iter vars. 3) Here it fails with: Check failed: is_one(Simplify(value->shape[i])): Argument b_buffer shape mismatch[16, 4] vs [(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer), (((((k.outer*4) + 3)/4) + 1) - k.outer), 16, 4] This is in buffer binding where it thinks expected and buffer bound shape is different. Although if we could simplify expr, this would not be the case. Test Plan: On skylake avx512 machine: python tests/python/contrib/test_gemm_acc16.py Reviewers: Subscribers: Tasks: Tags: * Implemented bounded analyzer which traverses tree and for reduce/for statements binds the bound of the analyzer. Later this is used to simplify expressions. Inspired from ir_mutator_with_analyzer Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Addressed comments. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Added ASF header + define macro for the header file: TVM_ARITHMETIC_IR_VISITOR_WITH_ANALYZER_H_ Some lint fixes as well. * Relax the assumption that dom_map must always contain all leaf itervars. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Disable copy constructor and move to raw ptr. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
… broken test. (apache#3981) * Changes to make tensorize work. These changes also fix the previously broken test. Summary: Tensorize was breaking for a few reasons. 1) Assert at: src/op/tensorize.cc:234 CHECK(is_one(e.region[j]->extent)) In some cases this cannot be proven, e.g.: expected shape=[16, 4], given region=[range(min=((ax1.outer*16)/16), ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)), range(min=((k.outer*4)/4), ext=(((((k.outer*4) + 3)/4) + 1) - k.outer)), range(min=0, ext=16), range(min=0, ext=4)] The unprovable one is: ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)). This can be simplified but it is not because to simplify divide, it must prove ax1.outer > 0 and since it is var it cannot. The fix for this to just find all the vars in expr in relace them with some const value. 2) Equivalence between tensorized expr and one being asked to tensorize. For example, the error would be. TVMError: Check failed: Equal(lhs, rhs): Failed to match the compute with TensorIntrin tensor_intrin's declaration provided= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0), intrin= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0) Difference is mainly in the source part: source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))] source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))] This was not being simpifiled due to compute_intrin_iter_space (map for iter var to range) not containing leaf iter vars. 3) Here it fails with: Check failed: is_one(Simplify(value->shape[i])): Argument b_buffer shape mismatch[16, 4] vs [(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer), (((((k.outer*4) + 3)/4) + 1) - k.outer), 16, 4] This is in buffer binding where it thinks expected and buffer bound shape is different. Although if we could simplify expr, this would not be the case. Test Plan: On skylake avx512 machine: python tests/python/contrib/test_gemm_acc16.py Reviewers: Subscribers: Tasks: Tags: * Implemented bounded analyzer which traverses tree and for reduce/for statements binds the bound of the analyzer. Later this is used to simplify expressions. Inspired from ir_mutator_with_analyzer Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Addressed comments. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Added ASF header + define macro for the header file: TVM_ARITHMETIC_IR_VISITOR_WITH_ANALYZER_H_ Some lint fixes as well. * Relax the assumption that dom_map must always contain all leaf itervars. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Disable copy constructor and move to raw ptr. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
… broken test. (apache#3981) * Changes to make tensorize work. These changes also fix the previously broken test. Summary: Tensorize was breaking for a few reasons. 1) Assert at: src/op/tensorize.cc:234 CHECK(is_one(e.region[j]->extent)) In some cases this cannot be proven, e.g.: expected shape=[16, 4], given region=[range(min=((ax1.outer*16)/16), ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)), range(min=((k.outer*4)/4), ext=(((((k.outer*4) + 3)/4) + 1) - k.outer)), range(min=0, ext=16), range(min=0, ext=4)] The unprovable one is: ext=(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer)). This can be simplified but it is not because to simplify divide, it must prove ax1.outer > 0 and since it is var it cannot. The fix for this to just find all the vars in expr in relace them with some const value. 2) Equivalence between tensorized expr and one being asked to tensorize. For example, the error would be. TVMError: Check failed: Equal(lhs, rhs): Failed to match the compute with TensorIntrin tensor_intrin's declaration provided= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0), intrin= reduce(combiner=comm_reducer(result=[(x + y)], lhs=[x], rhs=[y], identity_element=[(int16)0]), source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))], where=(bool)1, value_index=0) Difference is mainly in the source part: source=[(int16(data(k))*int16(kernel(((((((((k.outer.outer*64) + (k.outer.inner*2)) + k)/2)*128) + i) - (k.outer.inner*128)) - (k.outer.outer*4096)), ((((k.outer.outer*64) + (k.outer.inner*2)) + k) % 2))))] source=[(int16(data(k))*int16(kernel(i, k)))], axis=[iter_var(k, range(min=0, ext=2))] This was not being simpifiled due to compute_intrin_iter_space (map for iter var to range) not containing leaf iter vars. 3) Here it fails with: Check failed: is_one(Simplify(value->shape[i])): Argument b_buffer shape mismatch[16, 4] vs [(((((ax1.outer*16) + 15)/16) + 1) - ax1.outer), (((((k.outer*4) + 3)/4) + 1) - k.outer), 16, 4] This is in buffer binding where it thinks expected and buffer bound shape is different. Although if we could simplify expr, this would not be the case. Test Plan: On skylake avx512 machine: python tests/python/contrib/test_gemm_acc16.py Reviewers: Subscribers: Tasks: Tags: * Implemented bounded analyzer which traverses tree and for reduce/for statements binds the bound of the analyzer. Later this is used to simplify expressions. Inspired from ir_mutator_with_analyzer Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Addressed comments. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Added ASF header + define macro for the header file: TVM_ARITHMETIC_IR_VISITOR_WITH_ANALYZER_H_ Some lint fixes as well. * Relax the assumption that dom_map must always contain all leaf itervars. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Disable copy constructor and move to raw ptr. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
Summary:
Tensorize was breaking for a few reasons.
1)
Assert at: src/op/tensorize.cc:234 CHECK(is_one(e.region[j]->extent))
In some cases this cannot be proven, e.g.:
This can be simplified but it is not because to simplify divide, it must
prove ax1.outer > 0 and since it is var it cannot. The fix for this to
just find all the vars in expr in relace them with some const value.
the error would be.
Difference is mainly in the source part:
This was not being simpifiled due to compute_intrin_iter_space (map for
iter var to range) not containing leaf iter vars.
This is in buffer binding where it thinks expected and buffer bound
shape is different. Although if we could simplify expr, this would not
be the case.
Test Plan:
On skylake avx512 machine:
python tests/python/contrib/test_gemm_acc16.py
Reviewers:
Subscribers:
Tasks:
Tags:
Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers.