use 4x16s4 gemm kernel for meteor lake #6485

xujuntwt95329 · 2024-05-28T07:29:40Z

No description provided.

google-cla · 2024-05-28T07:29:45Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

xujuntwt95329 · 2024-05-28T07:30:34Z

This PR can re-produce data in #6480

fbarchard · 2024-05-28T08:18:11Z

cmake/DownloadCpuinfo.cmake

-  URL https://github.com/pytorch/cpuinfo/archive/d6860c477c99f1fce9e28eb206891af3c0e1a1d7.zip
-  URL_HASH SHA256=a615cac78fad03952cc3e1fd231ce789a8df6e81a5957b64350cb8200364b385
+  URL https://github.com/xujuntwt95329/cpuinfo/archive/10bf0a03827877a4c6896189f27e291f78a62ce5.zip
+  URL_HASH SHA256=2128b3087bf8c190128fbe8f1003552f84f5f0ed79c79868212d99eee11296eb


FYI we will need to manually patch this meteorlake detect into our internal cpuinfo

Thanks for the suggestion! I've submitted a PR to cpuinfo repo: pytorch/cpuinfo#247

If the review takes too much time, we can go ahead with manual patch to internal cpuinfo

fbarchard · 2024-05-28T20:31:46Z

src/configs/gemm-config.c

@@ -789,6 +789,7 @@ static void init_f32_gemm_config(void) {
      switch (cpuinfo_get_core(0)->uarch) {
        case cpuinfo_uarch_zen:
        case cpuinfo_uarch_dhyana:
+        case cpuinfo_uarch_meteor_lake:
          f32_gemm_config.minmax.gemm[XNN_MR_TO_INDEX(1)] = xnn_init_hmp_gemm_ukernel((xnn_gemm_ukernel_fn) xnn_f32_gemm_minmax_ukernel_1x16s4__fma3_broadcast);
          f32_gemm_config.minmax.gemm[XNN_MR_TO_INDEX(4)] = xnn_init_hmp_gemm_ukernel((xnn_gemm_ukernel_fn) xnn_f32_gemm_minmax_ukernel_4x16s4__fma3_broadcast);
          f32_gemm_config.minmax.igemm[XNN_MR_TO_INDEX(1)] = xnn_init_hmp_igemm_ukernel((xnn_igemm_ukernel_fn) xnn_f32_igemm_minmax_ukernel_1x16s4__fma3_broadcast);


xnn_f32_gemm_minmax_ukernel_4x16s4__fma3_broadcast is faster than xnn_f32_gemm_minmax_ukernel_5x16__fma3_broadcast?

The shuffle it does is 7 cycles for 4 floats, vs the avx_broadcast is 3 cycles for 1 float.
4x16s4 does a shuffer after each fma3, so 3 shuffles of 1 cycle each.

A 5x16 or 7x16 tends to work better than 4x16 but the s4 must be low on registers.

The f32_gemm_e2e_bench benchmark is an easy test against mobilenet
f32_gemm_e2e_bench --benchmark_filter=v2
On a Skylake Xeon
f32_gemm_3x16__fma3_broadcast/mobilenet_v2 12383 us
f32_gemm_4x16__fma3_broadcast/mobilenet_v2 11610 us
f32_gemm_5x16__fma3_broadcast/mobilenet_v2 11414 us <- old
f32_gemm_3x16s4__fma3_broadcast/mobilenet_v2 12668 us
f32_gemm_4x16s4__fma3_broadcast/mobilenet_v2 12444 us <- new
f32_gemm_5x16s4__fma3_broadcast/mobilenet_v2 16098 us

On Sapphire Rapids
f32_gemm_3x16__fma3_broadcast/mobilenet_v2 7888 us
f32_gemm_4x16__fma3_broadcast/mobilenet_v2 6901 us
f32_gemm_5x16__fma3_broadcast/mobilenet_v2 6665 us <- old
f32_gemm_3x16s4__fma3_broadcast/mobilenet_v2 8436 us
f32_gemm_4x16s4__fma3_broadcast/mobilenet_v2 7544 us <- new
f32_gemm_5x16s4__fma3_broadcast/mobilenet_v2 10783 us

But these are xeon's with fast memory/cache and older uarch. Can you run e2e and/or after making the change, run end2end_bench to make sure f32 is faster on mobilenet v1, v2 and v3?

Hi @fbarchard thanks for the explain and suggestion, I tested both end2end_bench and f32_gemm_e2e_bench on MTL:

For end2end_bench, the 4x16s4 kernel is indeed faster than the 5x16 one

benchmark 5x16 (us) 4x16s4 (us) Reduction on inference time (%)

FP32MobileNetV1/T:1/real_time 16193 10775 33.46

FP32MobileNetV2/T:1/real_time 8809 6626 24.78

FP32MobileNetV3Large/T:1/real_time 7756 6052 21.97

FP32MobileNetV3Small/T:1/real_time 2180 1970 9.63

And for f32_gemm_e2e_bench, the 4x16s4 kernel is better than 5x16 kernel, and the 4x16 kernel is better than 4x16s4 kernel.

f32_gemm_4x8__fma3_broadcast/mobilenet_v1/real_time 17698 us f32_gemm_5x8__fma3_broadcast/mobilenet_v1/real_time 14472 us f32_gemm_6x8__fma3_broadcast/mobilenet_v1/real_time 12606 us f32_gemm_7x8__fma3_broadcast/mobilenet_v1/real_time 12385 us f32_gemm_8x8__fma3_broadcast/mobilenet_v1/real_time 19982 us f32_gemm_3x16__fma3_broadcast/mobilenet_v1/real_time 13255 us f32_gemm_4x16__fma3_broadcast/mobilenet_v1/real_time 11103 us <==== best f32_gemm_5x16__fma3_broadcast/mobilenet_v1/real_time 16798 us <---- old f32_gemm_3x16s4__fma3_broadcast/mobilenet_v1/real_time 13898 us f32_gemm_4x16s4__fma3_broadcast/mobilenet_v1/real_time 11519 us <---- new f32_gemm_5x16s4__fma3_broadcast/mobilenet_v1/real_time 16710 us

I also tested this on my Coffee Lake CPU, and the result is similar to yours: the 5x16 kernel works better than 4x16s4.
I guess the difference on MTL may be due to different execution port scheduling strategy?

My guess is the 4x16S4 does a large load and 3 shuffles, vs 5x16 does 5 small loads and that meteorlake is has slow read and/or broadcast.
But the same microarchitecture is in Granite Rapids so it will be interesting to see if mtl and spr are the same.
If not, it may be the DDR/HBM and cache.
It will be good to have uarch so we can choose for each.

xujuntwt95329 · 2024-06-20T03:46:11Z

Thanks for all the review suggestions, we finally find out the performance issue was caused by compiler rather than the 5x16 kernel:

When building with MSVC, for 5x16 kernel there is one m256 vector spilled to stack, resulting in more store uops thus influence the performance. But when building with clang, it will re-use the ymm register for loading data from weight, no need to spill to stack.

The clang built binary shows that 5x16 kernel is better than 4x16 and 4x16s4 on MTL, so I will close this PR.

Thanks again for all the comments!

use 4x16s4 gemm kernel for meteor lake

0143aab

fbarchard reviewed May 28, 2024

View reviewed changes

xujuntwt95329 closed this Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use 4x16s4 gemm kernel for meteor lake #6485

use 4x16s4 gemm kernel for meteor lake #6485

xujuntwt95329 commented May 28, 2024

google-cla bot commented May 28, 2024

xujuntwt95329 commented May 28, 2024

fbarchard May 28, 2024

xujuntwt95329 May 28, 2024

fbarchard May 28, 2024

xujuntwt95329 May 30, 2024

fbarchard Jun 5, 2024

xujuntwt95329 commented Jun 20, 2024

benchmark	5x16 (us)	4x16s4 (us)	Reduction on inference time (%)
FP32MobileNetV1/T:1/real_time	16193	10775	33.46
FP32MobileNetV2/T:1/real_time	8809	6626	24.78
FP32MobileNetV3Large/T:1/real_time	7756	6052	21.97
FP32MobileNetV3Small/T:1/real_time	2180	1970	9.63

use 4x16s4 gemm kernel for meteor lake #6485

use 4x16s4 gemm kernel for meteor lake #6485

Conversation

xujuntwt95329 commented May 28, 2024

google-cla bot commented May 28, 2024

xujuntwt95329 commented May 28, 2024

fbarchard May 28, 2024

Choose a reason for hiding this comment

xujuntwt95329 May 28, 2024

Choose a reason for hiding this comment

fbarchard May 28, 2024

Choose a reason for hiding this comment

xujuntwt95329 May 30, 2024

Choose a reason for hiding this comment

fbarchard Jun 5, 2024

Choose a reason for hiding this comment

xujuntwt95329 commented Jun 20, 2024