[compute/cker] Optimize BMM for X86 #14238

tomdol · 2024-10-18T06:51:40Z

This commit rewrites the BatchMatMul reference implementation with the usage of optimized::Gemm

ONE-DCO-1.0-Signed-off-by: Tomasz Dolbniak t.dolbniak@partner.samsung.com

This draft is a follow-up of #13919 and is an attempt to solve #12140 which I took over after Jan

This commit rewrites the BatchMatMul reference implementation with the usage of optimized::Gemm ONE-DCO-1.0-Signed-off-by: Tomasz Dolbniak <t.dolbniak@partner.samsung.com>

tomdol · 2024-10-18T07:18:06Z

compute/cker/include/cker/operation/reference/BatchMatMul.h

+  BMMParams(const Shape &lhs_shape, const Shape &rhs_shape)
+  {
+    const Shape extended_lhs_shape = Shape::ExtendedShape(5, lhs_shape);
+    const Shape extended_rhs_shape = Shape::ExtendedShape(5, rhs_shape);


There are some checks in the code that limit the rank of the incoming tensors to no more than 4D. What is the reasoning behind extending the shape to 5D every time? In all test cases I executed the first dimension of the 5D extended shape was always equal to 1 but since this is a runtime-known information the compiler could not do anything about the outermost loops below.

So in other words, would it be ok to change this code to Shape::ExtendedShape(4, ...) or should 5D be kept because there's a use case why this is needed?

I am also curious of the meaning of 5. As you said, we usually work with 4.
I searched the history, and it pointed out this. The code is written by @ragmani, and he answered the following for the same question about 5 year ago.

BatchMatMul kernel supports rank up to 5.

Why do you want to reduce to 4? Does your kernel in this PR only support up to 4D? Also, if your code is more optimized one, why do you want to replace the kernel in reference code?

In addition, could you please you the full name of BatchMatMul, instead of BMM? I hardly imagine BMM means BatchMatMul.

Why do you want to reduce to 4? Does your kernel in this PR only support up to 4D?

It currently works with 5D, I've used the same approach as the reference implementation does but I was just curious about the reason of extending the shapes to 5 dimensions. By reading the code I still could not find a use case where a 5D shape is passed to the kernel itself so I thought wr could remove this outermost loop. Anyway we can leave it as is and possibly get back to this topic later.

if your code is more optimized one, why do you want to replace the kernel in reference code?

This was just done in the draft to present the overall approach and ask for some guidance. In this case should I add the branching between the optimized version and the reference version somewhere around here? https://github.com/Samsung/ONE/blob/master/compute/cker/include/cker/operation/BatchMatMul.h#L105

In this case should I add the branching between the optimized version and the reference version somewhere around here? https://github.com/Samsung/ONE/blob/master/compute/cker/include/cker/operation/BatchMatMul.h#L105

Yes.

In addition, please put optimized kernel in optimized/BatchMatMul.h.

Please use BatchMatMul instead of BMM.

I searched the history, and it pointed out this. The code is written by @ragmani, and he answered the following for the same question about 5 year ago.

@glistening
Sorry, I didn't implement the code. I just adding an assertion for checking shape.

By reading the code I still could not find a use case where a 5D shape is passed to the kernel itself so I thought wr could remove this outermost loop. Anyway we can leave it as is and possibly get back to this topic later.

@tomdol
I couldn't remember the use case. It's too long time ago.

@hseok-oh Do you remember the use case where a 5D shape is passed to the kernel?

Maybe TFLite did that. (Please refer TFLite 2.12.1's tensorflow/lite/kernels/internal/reference/batch_matmul.h)

tomdol · 2024-10-31T11:28:25Z

@ragmani could you please have a look and let me know if this is the right way to go for the first step of this op's optimization?

ragmani · 2024-10-31T13:30:48Z

@ragmani could you please have a look and let me know if this is the right way to go for the first step of this op's optimization?

@tomdol Sorry, I'm out of the office for a while. Anyway, this seems like a right way to go for the first step.
@Samsung/one_onert Could you check this draft instead?

chunseoklee · 2024-11-21T07:02:22Z

compute/cker/include/cker/operation/reference/BatchMatMul.h

+  lhs_params.cols = bmm_params.lhs_cols;
+
+  MatrixParams<float> rhs_params;
+  rhs_params.order = Order::kRowMajor;


Suggested change

rhs_params.order = Order::kRowMajor;

rhs_params.order = Order::kColMajor;

@tomdol Could you please check this ? (AFAIU, batch matmul's second input is normally col major. Let me know if wrong)

I have some results that I described here #14370 but long story short - this field doesn't matter for Gemm with Eigen so I believe setting of the order field could be removed from this kernel entirely.

tomdol added 3 commits October 14, 2024 16:56

[compute/cker] Optimize BMM for X86

68e2c9c

This commit rewrites the BatchMatMul reference implementation with the usage of optimized::Gemm ONE-DCO-1.0-Signed-off-by: Tomasz Dolbniak <t.dolbniak@partner.samsung.com>

Nonsingular batch handling

e92217c

Refactor of the optimized BMM

6143933

tomdol requested a review from ragmani October 18, 2024 06:51

Correction of dst params and pointers calculation

165060d

tomdol commented Oct 18, 2024

View reviewed changes

tomdol mentioned this pull request Oct 23, 2024

[compute/cker] Introduce compile-time definitions file #14244

Draft

tomdol mentioned this pull request Nov 5, 2024

[compute/cker] Optimize BatchMatMul for x86 #14305

Merged

chunseoklee reviewed Nov 21, 2024

View reviewed changes

This was referenced Nov 26, 2024

[cker] Storage order testing #14370

Closed

[compute/cker] Remove the storage order parametrization from BatchMatMul #14371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[compute/cker] Optimize BMM for X86 #14238

[compute/cker] Optimize BMM for X86 #14238

tomdol commented Oct 18, 2024

tomdol Oct 18, 2024

glistening Nov 1, 2024 •

edited

Loading

tomdol Nov 4, 2024

glistening Nov 5, 2024

ragmani Nov 27, 2024

hseok-oh Nov 27, 2024

tomdol commented Oct 31, 2024

ragmani commented Oct 31, 2024 •

edited

Loading

chunseoklee Nov 21, 2024

tomdol Nov 26, 2024

	rhs_params.order = Order::kRowMajor;
	rhs_params.order = Order::kColMajor;

[compute/cker] Optimize BMM for X86 #14238

Are you sure you want to change the base?

[compute/cker] Optimize BMM for X86 #14238

Conversation

tomdol commented Oct 18, 2024

tomdol Oct 18, 2024

Choose a reason for hiding this comment

glistening Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

tomdol Nov 4, 2024

Choose a reason for hiding this comment

glistening Nov 5, 2024

Choose a reason for hiding this comment

ragmani Nov 27, 2024

Choose a reason for hiding this comment

hseok-oh Nov 27, 2024

Choose a reason for hiding this comment

tomdol commented Oct 31, 2024

ragmani commented Oct 31, 2024 • edited Loading

chunseoklee Nov 21, 2024

Choose a reason for hiding this comment

tomdol Nov 26, 2024

Choose a reason for hiding this comment

glistening Nov 1, 2024 •

edited

Loading

ragmani commented Oct 31, 2024 •

edited

Loading