-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[compute/cker] Optimize BMM for X86 #14238
base: master
Are you sure you want to change the base?
Conversation
This commit rewrites the BatchMatMul reference implementation with the usage of optimized::Gemm ONE-DCO-1.0-Signed-off-by: Tomasz Dolbniak <t.dolbniak@partner.samsung.com>
BMMParams(const Shape &lhs_shape, const Shape &rhs_shape) | ||
{ | ||
const Shape extended_lhs_shape = Shape::ExtendedShape(5, lhs_shape); | ||
const Shape extended_rhs_shape = Shape::ExtendedShape(5, rhs_shape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some checks in the code that limit the rank of the incoming tensors to no more than 4D. What is the reasoning behind extending the shape to 5D every time? In all test cases I executed the first dimension of the 5D extended shape was always equal to 1 but since this is a runtime-known information the compiler could not do anything about the outermost loops below.
So in other words, would it be ok to change this code to Shape::ExtendedShape(4, ...)
or should 5D be kept because there's a use case why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also curious of the meaning of 5. As you said, we usually work with 4
.
I searched the history, and it pointed out this. The code is written by @ragmani, and he answered the following for the same question about 5 year ago.
BatchMatMul kernel supports rank up to 5.
Why do you want to reduce to 4
? Does your kernel in this PR only support up to 4D? Also, if your code is more optimized one, why do you want to replace the kernel in reference code?
In addition, could you please you the full name of BatchMatMul
, instead of BMM? I hardly imagine BMM means BatchMatMul.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to reduce to 4? Does your kernel in this PR only support up to 4D?
It currently works with 5D, I've used the same approach as the reference implementation does but I was just curious about the reason of extending the shapes to 5 dimensions. By reading the code I still could not find a use case where a 5D shape is passed to the kernel itself so I thought wr could remove this outermost loop. Anyway we can leave it as is and possibly get back to this topic later.
if your code is more optimized one, why do you want to replace the kernel in reference code?
This was just done in the draft to present the overall approach and ask for some guidance. In this case should I add the branching between the optimized version and the reference version somewhere around here? https://github.com/Samsung/ONE/blob/master/compute/cker/include/cker/operation/BatchMatMul.h#L105
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case should I add the branching between the optimized version and the reference version somewhere around here? https://github.com/Samsung/ONE/blob/master/compute/cker/include/cker/operation/BatchMatMul.h#L105
Yes.
In addition, please put optimized kernel in optimized/BatchMatMul.h
.
Please use BatchMatMul instead of BMM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I searched the history, and it pointed out this. The code is written by @ragmani, and he answered the following for the same question about 5 year ago.
@glistening
Sorry, I didn't implement the code. I just adding an assertion for checking shape.
By reading the code I still could not find a use case where a 5D shape is passed to the kernel itself so I thought wr could remove this outermost loop. Anyway we can leave it as is and possibly get back to this topic later.
@tomdol
I couldn't remember the use case. It's too long time ago.
@hseok-oh Do you remember the use case where a 5D shape is passed to the kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe TFLite did that. (Please refer TFLite 2.12.1's tensorflow/lite/kernels/internal/reference/batch_matmul.h
)
@ragmani could you please have a look and let me know if this is the right way to go for the first step of this op's optimization? |
lhs_params.cols = bmm_params.lhs_cols; | ||
|
||
MatrixParams<float> rhs_params; | ||
rhs_params.order = Order::kRowMajor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rhs_params.order = Order::kRowMajor; | |
rhs_params.order = Order::kColMajor; |
@tomdol Could you please check this ? (AFAIU, batch matmul's second input is normally col major. Let me know if wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some results that I described here #14370 but long story short - this field doesn't matter for Gemm with Eigen so I believe setting of the order
field could be removed from this kernel entirely.
This commit rewrites the BatchMatMul reference implementation with the usage of optimized::Gemm
ONE-DCO-1.0-Signed-off-by: Tomasz Dolbniak t.dolbniak@partner.samsung.com
This draft is a follow-up of #13919 and is an attempt to solve #12140 which I took over after Jan