Blocked memory format tag (any vs BA16a64b4a) and kernel selection with INT8 #2730

SriAlavandar · 2025-02-21T18:25:13Z

I am trying to run Matmul standalone with the blocked format and INT8 datatype.

Here is the combination that I am running with

M=700, N=1024, K=512
dtype=u8:s8:u8
scales and zero points.

Case 1: I am creating Memory descriptor for B Matrix with memory format tag any (tag::any)

Here is the log that we are observing

onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src_u8::blocked:ab:f0 wei_s8::blocked:BA16a64b4a:f8:zpm2 bia_f32::blocked:ab:f0_mask2 dst_u8::blocked:ab:f0,attr-zero-points:src:0:+dst:0: ,700x1024:1024x512:700x512,2.63306,ms

Case 2: I am creating Memory descriptor for B Matrix with static format tag (tag::BA16a64b4a)

onednn_verbose,v1,primitive,exec,cpu,matmul,ref_int8:any,undef,src_u8::blocked:ab:f0 wei_s8::blocked:BA16a64b4a:f0 bia_f32::blocked:ab:f0_mask2 dst_u8::blocked:ab:f0,attr-oscale:2 attr-zero-points:src:0:+dst:0: ,700x1024:1024x512:700x512,5834.56,ms

Here are the following difference we can observe from above experiments

With tag::any the blocked memory format tag that is used for execution is wei_s8::blocked:BA16a64b4a:f8:zpm2 where as with tag::BA16a64b4a the weights memory tag that is used for execution is wei_s8::blocked:BA16a64b4a:f0
tag::any redirects the computation to brg_matmul:avx512_core_vnni kernel, Static tag (tag::BA16a64b4a) redirects the control to ref_int8 kernel

Questions:

We can see the differences in the generated tags (blocked:BA16a64b4a:f0 vs blocked:BA16a64b4a:f8:zpm2) what does f0, f8 and zpm2 represents here?
When we explicitly set the weights tag to BA16a64b4a why does execution redirects to ref kernel ?
Is it possible to specify f8 and zpm2 during Memory descriptor creation (along with tag::BA16a64b4a) ?

kminemur · 2025-02-25T02:29:39Z

Hi @SriAlavandar

Thank you for reaching us.

Let me answer one by one.

For Case1, according to the flag definition (https://github.com/oneapi-src/oneDNN/blob/main/src/common/verbose.cpp#L386-L412), they are memory extra flags.

f# is showing extra flag for dnnl memory format
f0 is no extra flag.
f8 is flag for compensation conv asymmetric src.
zpm2 is for 2-dimension compensation conv asymmetric src

This memory layout (wei_s8::blocked:BA16a64b4a:f8:zpm2) looks like coming from user's definition.

Are you using a user-defined operation for matmul?
Could you share your command for matmul with benchdnn?

kminemur · 2025-02-25T03:55:28Z

[Update] I have tested with benchdnn matmul without zero point mapping option, then no issues come out.

DNNL_VERBOSE=1 ./tests/benchdnn/benchdnn --matmul --wtag=any --dt=u8:s8:u8 700x512:512x1024

onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src:u8:a:blocked:ab::f0 wei:s8:a:blocked:BA16a64b4a::f0 dst:u8:a:blocked:ab::f0,,,700x512:512x1024,0.60791

DNNL_VERBOSE=1 ./tests/benchdnn/benchdnn --matmul --wtag=BA16a64b4a --dt=u8:s8:u8 700x512:512x1024
onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src:u8:a:blocked:ab::f0 wei:s8::blocked:BA16a64b4a::f0 dst:u8:a:blocked:ab::f0,,,700x512:512x1024,0.850098

kazukiminemura · 2025-03-05T06:33:17Z

Hi @SriAlavandar

Could you provide us with a way to reproduce the steps of your issue?

dzarukin · 2025-03-07T00:47:40Z

@SriAlavandar

We can see the differences in the generated tags (blocked:BA16a64b4a:f0 vs blocked:BA16a64b4a:f8:zpm2) what does f0, f8 and zpm2 represents here?

f8:zpm2 denotes a service information. f8 means extra flag value is set to 8. zpm2 means zero-point-mask is set to 2, or per-channel. This indicates that compensation buffer will be triggered upon reorder and additional accumulation will be done during matmul implementation.

When we explicitly set the weights tag to BA16a64b4a why does execution redirects to ref kernel ?

The difference is which side controls the amount of memory to allocate. When user passes any, they let the library decide, the library identifies the need for special buffer and instructs a memory constructor through the memory descriptor object how much memory must be allocated. Pre-computed compensation buffer will be a part of it. When user forces the format but the matmul problem still expects zero-points compensation buffer it simply doesn't have instruments to apply that compensation on the library side.

You can read this behavior as "oneDNN can't make an optimized version of the matmul algorithm with src zero point other than through a special trick". Yet we can, but performance-wise there would be no sense as it requires upcasting int8 buffers to s32, and the whole point of int8 is just evaporates.

The only option that you can do in such case if you really want to force that format on B is to make per-K reduction of B (or sum all K values, get N values total) and add that result through a binary post-op to the original matmul output dropping src zero-point from it. That would be the identical outcome (under some compliance of data types).

Is it possible to specify f8 and zpm2 during Memory descriptor creation (along with tag::BA16a64b4a) ?

Nope. Only the library can initialize memory descriptors to that. Recommendations are provided above.

Feel free to follow-up. Thank you.

SriAlavandar · 2025-03-07T06:14:41Z

Thanks for the response

Hi @kminemur @kazukiminemura

Yes, We will not be observing this behavior with BenchDNN as mentioned above is due to zero-point is not passed.
I have used a standalone similar to this example. By changing the weights tag from any to BA16a64b4a we should be observing the above behavior.

With BenchDNN do we have any option to pass zero-point and scales when we are dealing with INT8 ?

dzarukin · 2025-03-07T07:46:46Z

There's a converter from verbose log to benchdnn.

This is the benchdnn line for main branch for the verbose you shared:

--dt=u8:s8:u8 --stag=ab --dtag=ab --bia-dt=f32 --attr-zero-points=src:common:1+dst:common:1 700x1024:1024x512

Though I'm not sure which version you are on and whether you pre-modified the line before posting it, but the one posted is definitely not the fresh one...

kminemur · 2025-03-07T08:21:35Z

Thanks @dzarukin

Hi @SriAlavandar
I confirmed that ref kernel is selected by "--wtag=BA16a64b4a"
This is my full command with oneDNN v3.7.0 (commit 862289d)

DNNL_VERBOSE=1 ./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --stag=ab --dtag=ab --wtag=any --bia_dt=f32 --attr-scales=src:common:0+wei:common:2+dst:common:0 --attr-zero-points=src:common:1+dst:common:1 700x1024:1024x512 | grep matmul
onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src:u8::blocked:ab::f0 wei:s8:a:blocked:BA16a64b4a::f8:zpm2 bia:f32:a:blocked:ab::f0_mask2 dst:u8::blocked:ab::f0,attr-scales:src0:0:f32+dst:0:f32+wei:0:f32 attr-zero-points:src0:0:s32+dst:0:s32,,700x1024:1024x512,4.93604
0:PASSED __REPRO: --matmul --dt=u8:s8:u8 --stag=ab --dtag=ab --bia_dt=f32 --attr-scales=src:common:0+dst:common:0+wei:common:2 --attr-zero-points=src:common:1+dst:common:1 700x1024:1024x512

DNNL_VERBOSE=1 ./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --stag=ab --dtag=ab --wtag=BA16a64b4a --bia_dt=f32 --attr-scales=src:common:0+wei:common:2+dst:common:0 --attr-zero-points=src:common:1+dst:common:1 700x1024:1024x512 | grep matmul
onednn_verbose,v1,primitive,exec,cpu,matmul,ref_int8:any,undef,src:u8::blocked:ab::f0 wei:s8::blocked:BA16a64b4a::f0 bia:f32:a:blocked:ab::f0_mask2 dst:u8::blocked:ab::f0,attr-scales:src0:0:f32+dst:0:f32+wei:0:f32 attr-zero-points:src0:0:s32+dst:0:s32,,700x1024:1024x512,1860.15
0:PASSED __REPRO: --matmul --dt=u8:s8:u8 --stag=ab --wtag=BA16a64b4a --dtag=ab --bia_dt=f32 --attr-scales=src:common:0+dst:common:0+wei:common:2 --attr-zero-points=src:common:1+dst:common:1 700x1024:1024x512

SriAlavandar added the question label Feb 21, 2025

vpirogov assigned onednnsupporttriage Feb 21, 2025

kminemur self-assigned this Feb 22, 2025

shu1chen unassigned onednnsupporttriage Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocked memory format tag (any vs BA16a64b4a) and kernel selection with INT8 #2730

Blocked memory format tag (any vs BA16a64b4a) and kernel selection with INT8 #2730

SriAlavandar commented Feb 21, 2025

kminemur commented Feb 25, 2025 •

edited

Loading

kminemur commented Feb 25, 2025

kazukiminemura commented Mar 5, 2025

dzarukin commented Mar 7, 2025

SriAlavandar commented Mar 7, 2025

dzarukin commented Mar 7, 2025

kminemur commented Mar 7, 2025 •

edited

Loading

Blocked memory format tag (any vs BA16a64b4a) and kernel selection with INT8 #2730

Blocked memory format tag (any vs BA16a64b4a) and kernel selection with INT8 #2730

Comments

SriAlavandar commented Feb 21, 2025

kminemur commented Feb 25, 2025 • edited Loading

kminemur commented Feb 25, 2025

kazukiminemura commented Mar 5, 2025

dzarukin commented Mar 7, 2025

SriAlavandar commented Mar 7, 2025

dzarukin commented Mar 7, 2025

kminemur commented Mar 7, 2025 • edited Loading

kminemur commented Feb 25, 2025 •

edited

Loading

kminemur commented Mar 7, 2025 •

edited

Loading