Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use 4x16s4 gemm kernel for meteor lake #6485

Closed
wants to merge 1 commit into from

Conversation

xujuntwt95329
Copy link
Contributor

No description provided.

Copy link

google-cla bot commented May 28, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@xujuntwt95329
Copy link
Contributor Author

This PR can re-produce data in #6480

URL https://github.com/pytorch/cpuinfo/archive/d6860c477c99f1fce9e28eb206891af3c0e1a1d7.zip
URL_HASH SHA256=a615cac78fad03952cc3e1fd231ce789a8df6e81a5957b64350cb8200364b385
URL https://github.com/xujuntwt95329/cpuinfo/archive/10bf0a03827877a4c6896189f27e291f78a62ce5.zip
URL_HASH SHA256=2128b3087bf8c190128fbe8f1003552f84f5f0ed79c79868212d99eee11296eb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI we will need to manually patch this meteorlake detect into our internal cpuinfo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've submitted a PR to cpuinfo repo: pytorch/cpuinfo#247

If the review takes too much time, we can go ahead with manual patch to internal cpuinfo

@@ -789,6 +789,7 @@ static void init_f32_gemm_config(void) {
switch (cpuinfo_get_core(0)->uarch) {
case cpuinfo_uarch_zen:
case cpuinfo_uarch_dhyana:
case cpuinfo_uarch_meteor_lake:
f32_gemm_config.minmax.gemm[XNN_MR_TO_INDEX(1)] = xnn_init_hmp_gemm_ukernel((xnn_gemm_ukernel_fn) xnn_f32_gemm_minmax_ukernel_1x16s4__fma3_broadcast);
f32_gemm_config.minmax.gemm[XNN_MR_TO_INDEX(4)] = xnn_init_hmp_gemm_ukernel((xnn_gemm_ukernel_fn) xnn_f32_gemm_minmax_ukernel_4x16s4__fma3_broadcast);
f32_gemm_config.minmax.igemm[XNN_MR_TO_INDEX(1)] = xnn_init_hmp_igemm_ukernel((xnn_igemm_ukernel_fn) xnn_f32_igemm_minmax_ukernel_1x16s4__fma3_broadcast);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xnn_f32_gemm_minmax_ukernel_4x16s4__fma3_broadcast is faster than xnn_f32_gemm_minmax_ukernel_5x16__fma3_broadcast?

The shuffle it does is 7 cycles for 4 floats, vs the avx_broadcast is 3 cycles for 1 float.
4x16s4 does a shuffer after each fma3, so 3 shuffles of 1 cycle each.

A 5x16 or 7x16 tends to work better than 4x16 but the s4 must be low on registers.

The f32_gemm_e2e_bench benchmark is an easy test against mobilenet
f32_gemm_e2e_bench --benchmark_filter=v2
On a Skylake Xeon
f32_gemm_3x16__fma3_broadcast/mobilenet_v2 12383 us
f32_gemm_4x16__fma3_broadcast/mobilenet_v2 11610 us
f32_gemm_5x16__fma3_broadcast/mobilenet_v2 11414 us <- old
f32_gemm_3x16s4__fma3_broadcast/mobilenet_v2 12668 us
f32_gemm_4x16s4__fma3_broadcast/mobilenet_v2 12444 us <- new
f32_gemm_5x16s4__fma3_broadcast/mobilenet_v2 16098 us

On Sapphire Rapids
f32_gemm_3x16__fma3_broadcast/mobilenet_v2 7888 us
f32_gemm_4x16__fma3_broadcast/mobilenet_v2 6901 us
f32_gemm_5x16__fma3_broadcast/mobilenet_v2 6665 us <- old
f32_gemm_3x16s4__fma3_broadcast/mobilenet_v2 8436 us
f32_gemm_4x16s4__fma3_broadcast/mobilenet_v2 7544 us <- new
f32_gemm_5x16s4__fma3_broadcast/mobilenet_v2 10783 us

But these are xeon's with fast memory/cache and older uarch. Can you run e2e and/or after making the change, run end2end_bench to make sure f32 is faster on mobilenet v1, v2 and v3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fbarchard thanks for the explain and suggestion, I tested both end2end_bench and f32_gemm_e2e_bench on MTL:

For end2end_bench, the 4x16s4 kernel is indeed faster than the 5x16 one

benchmark 5x16 (us) 4x16s4 (us) Reduction on inference time (%)
FP32MobileNetV1/T:1/real_time 16193 10775 33.46
FP32MobileNetV2/T:1/real_time 8809 6626 24.78
FP32MobileNetV3Large/T:1/real_time 7756 6052 21.97
FP32MobileNetV3Small/T:1/real_time 2180 1970 9.63

And for f32_gemm_e2e_bench, the 4x16s4 kernel is better than 5x16 kernel, and the 4x16 kernel is better than 4x16s4 kernel.

f32_gemm_4x8__fma3_broadcast/mobilenet_v1/real_time         17698 us
f32_gemm_5x8__fma3_broadcast/mobilenet_v1/real_time         14472 us
f32_gemm_6x8__fma3_broadcast/mobilenet_v1/real_time         12606 us
f32_gemm_7x8__fma3_broadcast/mobilenet_v1/real_time         12385 us
f32_gemm_8x8__fma3_broadcast/mobilenet_v1/real_time         19982 us
f32_gemm_3x16__fma3_broadcast/mobilenet_v1/real_time        13255 us
f32_gemm_4x16__fma3_broadcast/mobilenet_v1/real_time        11103 us    <==== best
f32_gemm_5x16__fma3_broadcast/mobilenet_v1/real_time        16798 us    <---- old
f32_gemm_3x16s4__fma3_broadcast/mobilenet_v1/real_time      13898 us
f32_gemm_4x16s4__fma3_broadcast/mobilenet_v1/real_time      11519 us    <---- new
f32_gemm_5x16s4__fma3_broadcast/mobilenet_v1/real_time      16710 us

I also tested this on my Coffee Lake CPU, and the result is similar to yours: the 5x16 kernel works better than 4x16s4.
I guess the difference on MTL may be due to different execution port scheduling strategy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is the 4x16S4 does a large load and 3 shuffles, vs 5x16 does 5 small loads and that meteorlake is has slow read and/or broadcast.
But the same microarchitecture is in Granite Rapids so it will be interesting to see if mtl and spr are the same.
If not, it may be the DDR/HBM and cache.
It will be good to have uarch so we can choose for each.

@xujuntwt95329
Copy link
Contributor Author

Thanks for all the review suggestions, we finally find out the performance issue was caused by compiler rather than the 5x16 kernel:

When building with MSVC, for 5x16 kernel there is one m256 vector spilled to stack, resulting in more store uops thus influence the performance. But when building with clang, it will re-use the ymm register for loading data from weight, no need to spill to stack.

The clang built binary shows that 5x16 kernel is better than 4x16 and 4x16s4 on MTL, so I will close this PR.

Thanks again for all the comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants