Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocked memory format tag (any vs BA16a64b4a) and kernel selection with INT8 #2730

Open
SriAlavandar opened this issue Feb 21, 2025 · 7 comments
Assignees
Labels

Comments

@SriAlavandar
Copy link

I am trying to run Matmul standalone with the blocked format and INT8 datatype.

Here is the combination that I am running with

  • M=700, N=1024, K=512
  • dtype=u8:s8:u8
  • scales and zero points.

Case 1: I am creating Memory descriptor for B Matrix with memory format tag any (tag::any)

Here is the log that we are observing

onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src_u8::blocked:ab:f0 wei_s8::blocked:BA16a64b4a:f8:zpm2 bia_f32::blocked:ab:f0_mask2 dst_u8::blocked:ab:f0,attr-zero-points:src:0:+dst:0: ,700x1024:1024x512:700x512,2.63306,ms

Case 2: I am creating Memory descriptor for B Matrix with static format tag (tag::BA16a64b4a)

onednn_verbose,v1,primitive,exec,cpu,matmul,ref_int8:any,undef,src_u8::blocked:ab:f0 wei_s8::blocked:BA16a64b4a:f0 bia_f32::blocked:ab:f0_mask2 dst_u8::blocked:ab:f0,attr-oscale:2 attr-zero-points:src:0:+dst:0: ,700x1024:1024x512:700x512,5834.56,ms

Here are the following difference we can observe from above experiments

  1. With tag::any the blocked memory format tag that is used for execution is wei_s8::blocked:BA16a64b4a:f8:zpm2 where as with tag::BA16a64b4a the weights memory tag that is used for execution is wei_s8::blocked:BA16a64b4a:f0
  2. tag::any redirects the computation to brg_matmul:avx512_core_vnni kernel, Static tag (tag::BA16a64b4a) redirects the control to ref_int8 kernel

Questions:

  1. We can see the differences in the generated tags (blocked:BA16a64b4a:f0 vs blocked:BA16a64b4a:f8:zpm2) what does f0, f8 and zpm2 represents here?
  2. When we explicitly set the weights tag to BA16a64b4a why does execution redirects to ref kernel ?
  3. Is it possible to specify f8 and zpm2 during Memory descriptor creation (along with tag::BA16a64b4a) ?
@kminemur
Copy link

kminemur commented Feb 25, 2025

Hi @SriAlavandar

Thank you for reaching us.

Let me answer one by one.

For Case1, according to the flag definition (https://github.com/oneapi-src/oneDNN/blob/main/src/common/verbose.cpp#L386-L412), they are memory extra flags.

f# is showing extra flag for dnnl memory format
f0 is no extra flag.
f8 is flag for compensation conv asymmetric src.
zpm2 is for 2-dimension compensation conv asymmetric src

This memory layout (wei_s8::blocked:BA16a64b4a:f8:zpm2) looks like coming from user's definition.

Are you using a user-defined operation for matmul?
Could you share your command for matmul with benchdnn?

@kminemur
Copy link

[Update] I have tested with benchdnn matmul without zero point mapping option, then no issues come out.

DNNL_VERBOSE=1 ./tests/benchdnn/benchdnn --matmul --wtag=any --dt=u8:s8:u8 700x512:512x1024

onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src:u8:a:blocked:ab::f0 wei:s8:a:blocked:BA16a64b4a::f0 dst:u8:a:blocked:ab::f0,,,700x512:512x1024,0.60791

DNNL_VERBOSE=1 ./tests/benchdnn/benchdnn --matmul --wtag=BA16a64b4a --dt=u8:s8:u8 700x512:512x1024
onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src:u8:a:blocked:ab::f0 wei:s8::blocked:BA16a64b4a::f0 dst:u8:a:blocked:ab::f0,,,700x512:512x1024,0.850098

@kazukiminemura
Copy link

Hi @SriAlavandar

Could you provide us with a way to reproduce the steps of your issue?

@dzarukin
Copy link
Contributor

dzarukin commented Mar 7, 2025

@SriAlavandar

  1. We can see the differences in the generated tags (blocked:BA16a64b4a:f0 vs blocked:BA16a64b4a:f8:zpm2) what does f0, f8 and zpm2 represents here?

f8:zpm2 denotes a service information. f8 means extra flag value is set to 8. zpm2 means zero-point-mask is set to 2, or per-channel. This indicates that compensation buffer will be triggered upon reorder and additional accumulation will be done during matmul implementation.

  1. When we explicitly set the weights tag to BA16a64b4a why does execution redirects to ref kernel ?

The difference is which side controls the amount of memory to allocate. When user passes any, they let the library decide, the library identifies the need for special buffer and instructs a memory constructor through the memory descriptor object how much memory must be allocated. Pre-computed compensation buffer will be a part of it. When user forces the format but the matmul problem still expects zero-points compensation buffer it simply doesn't have instruments to apply that compensation on the library side.

You can read this behavior as "oneDNN can't make an optimized version of the matmul algorithm with src zero point other than through a special trick". Yet we can, but performance-wise there would be no sense as it requires upcasting int8 buffers to s32, and the whole point of int8 is just evaporates.

The only option that you can do in such case if you really want to force that format on B is to make per-K reduction of B (or sum all K values, get N values total) and add that result through a binary post-op to the original matmul output dropping src zero-point from it. That would be the identical outcome (under some compliance of data types).

  1. Is it possible to specify f8 and zpm2 during Memory descriptor creation (along with tag::BA16a64b4a) ?

Nope. Only the library can initialize memory descriptors to that. Recommendations are provided above.

Feel free to follow-up. Thank you.

@SriAlavandar
Copy link
Author

Thanks for the response

Hi @kminemur @kazukiminemura

Yes, We will not be observing this behavior with BenchDNN as mentioned above is due to zero-point is not passed.
I have used a standalone similar to this example. By changing the weights tag from any to BA16a64b4a we should be observing the above behavior.

With BenchDNN do we have any option to pass zero-point and scales when we are dealing with INT8 ?

@dzarukin
Copy link
Contributor

dzarukin commented Mar 7, 2025

There's a converter from verbose log to benchdnn.

This is the benchdnn line for main branch for the verbose you shared:

--dt=u8:s8:u8 --stag=ab --dtag=ab --bia-dt=f32 --attr-zero-points=src:common:1+dst:common:1 700x1024:1024x512

Though I'm not sure which version you are on and whether you pre-modified the line before posting it, but the one posted is definitely not the fresh one...

@kminemur
Copy link

kminemur commented Mar 7, 2025

Thanks @dzarukin

Hi @SriAlavandar
I confirmed that ref kernel is selected by "--wtag=BA16a64b4a"
This is my full command with oneDNN v3.7.0 (commit 862289d)

DNNL_VERBOSE=1 ./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --stag=ab --dtag=ab --wtag=any --bia_dt=f32 --attr-scales=src:common:0+wei:common:2+dst:common:0 --attr-zero-points=src:common:1+dst:common:1 700x1024:1024x512 | grep matmul
onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src:u8::blocked:ab::f0 wei:s8:a:blocked:BA16a64b4a::f8:zpm2 bia:f32:a:blocked:ab::f0_mask2 dst:u8::blocked:ab::f0,attr-scales:src0:0:f32+dst:0:f32+wei:0:f32 attr-zero-points:src0:0:s32+dst:0:s32,,700x1024:1024x512,4.93604
0:PASSED __REPRO: --matmul --dt=u8:s8:u8 --stag=ab --dtag=ab --bia_dt=f32 --attr-scales=src:common:0+dst:common:0+wei:common:2 --attr-zero-points=src:common:1+dst:common:1 700x1024:1024x512

DNNL_VERBOSE=1 ./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --stag=ab --dtag=ab --wtag=BA16a64b4a --bia_dt=f32 --attr-scales=src:common:0+wei:common:2+dst:common:0 --attr-zero-points=src:common:1+dst:common:1 700x1024:1024x512 | grep matmul
onednn_verbose,v1,primitive,exec,cpu,matmul,ref_int8:any,undef,src:u8::blocked:ab::f0 wei:s8::blocked:BA16a64b4a::f0 bia:f32:a:blocked:ab::f0_mask2 dst:u8::blocked:ab::f0,attr-scales:src0:0:f32+dst:0:f32+wei:0:f32 attr-zero-points:src0:0:s32+dst:0:s32,,700x1024:1024x512,1860.15
0:PASSED __REPRO: --matmul --dt=u8:s8:u8 --stag=ab --wtag=BA16a64b4a --dtag=ab --bia_dt=f32 --attr-scales=src:common:0+dst:common:0+wei:common:2 --attr-zero-points=src:common:1+dst:common:1 700x1024:1024x512

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants