-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use 4x16s4 gemm kernel for meteor lake #6485
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
This PR can re-produce data in #6480 |
URL https://github.com/pytorch/cpuinfo/archive/d6860c477c99f1fce9e28eb206891af3c0e1a1d7.zip | ||
URL_HASH SHA256=a615cac78fad03952cc3e1fd231ce789a8df6e81a5957b64350cb8200364b385 | ||
URL https://github.com/xujuntwt95329/cpuinfo/archive/10bf0a03827877a4c6896189f27e291f78a62ce5.zip | ||
URL_HASH SHA256=2128b3087bf8c190128fbe8f1003552f84f5f0ed79c79868212d99eee11296eb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI we will need to manually patch this meteorlake detect into our internal cpuinfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I've submitted a PR to cpuinfo repo: pytorch/cpuinfo#247
If the review takes too much time, we can go ahead with manual patch to internal cpuinfo
@@ -789,6 +789,7 @@ static void init_f32_gemm_config(void) { | |||
switch (cpuinfo_get_core(0)->uarch) { | |||
case cpuinfo_uarch_zen: | |||
case cpuinfo_uarch_dhyana: | |||
case cpuinfo_uarch_meteor_lake: | |||
f32_gemm_config.minmax.gemm[XNN_MR_TO_INDEX(1)] = xnn_init_hmp_gemm_ukernel((xnn_gemm_ukernel_fn) xnn_f32_gemm_minmax_ukernel_1x16s4__fma3_broadcast); | |||
f32_gemm_config.minmax.gemm[XNN_MR_TO_INDEX(4)] = xnn_init_hmp_gemm_ukernel((xnn_gemm_ukernel_fn) xnn_f32_gemm_minmax_ukernel_4x16s4__fma3_broadcast); | |||
f32_gemm_config.minmax.igemm[XNN_MR_TO_INDEX(1)] = xnn_init_hmp_igemm_ukernel((xnn_igemm_ukernel_fn) xnn_f32_igemm_minmax_ukernel_1x16s4__fma3_broadcast); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xnn_f32_gemm_minmax_ukernel_4x16s4__fma3_broadcast is faster than xnn_f32_gemm_minmax_ukernel_5x16__fma3_broadcast?
The shuffle it does is 7 cycles for 4 floats, vs the avx_broadcast is 3 cycles for 1 float.
4x16s4 does a shuffer after each fma3, so 3 shuffles of 1 cycle each.
A 5x16 or 7x16 tends to work better than 4x16 but the s4 must be low on registers.
The f32_gemm_e2e_bench benchmark is an easy test against mobilenet
f32_gemm_e2e_bench --benchmark_filter=v2
On a Skylake Xeon
f32_gemm_3x16__fma3_broadcast/mobilenet_v2 12383 us
f32_gemm_4x16__fma3_broadcast/mobilenet_v2 11610 us
f32_gemm_5x16__fma3_broadcast/mobilenet_v2 11414 us <- old
f32_gemm_3x16s4__fma3_broadcast/mobilenet_v2 12668 us
f32_gemm_4x16s4__fma3_broadcast/mobilenet_v2 12444 us <- new
f32_gemm_5x16s4__fma3_broadcast/mobilenet_v2 16098 us
On Sapphire Rapids
f32_gemm_3x16__fma3_broadcast/mobilenet_v2 7888 us
f32_gemm_4x16__fma3_broadcast/mobilenet_v2 6901 us
f32_gemm_5x16__fma3_broadcast/mobilenet_v2 6665 us <- old
f32_gemm_3x16s4__fma3_broadcast/mobilenet_v2 8436 us
f32_gemm_4x16s4__fma3_broadcast/mobilenet_v2 7544 us <- new
f32_gemm_5x16s4__fma3_broadcast/mobilenet_v2 10783 us
But these are xeon's with fast memory/cache and older uarch. Can you run e2e and/or after making the change, run end2end_bench to make sure f32 is faster on mobilenet v1, v2 and v3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fbarchard thanks for the explain and suggestion, I tested both end2end_bench
and f32_gemm_e2e_bench
on MTL:
For end2end_bench, the 4x16s4 kernel is indeed faster than the 5x16 one
benchmark | 5x16 (us) | 4x16s4 (us) | Reduction on inference time (%) |
---|---|---|---|
FP32MobileNetV1/T:1/real_time | 16193 | 10775 | 33.46 |
FP32MobileNetV2/T:1/real_time | 8809 | 6626 | 24.78 |
FP32MobileNetV3Large/T:1/real_time | 7756 | 6052 | 21.97 |
FP32MobileNetV3Small/T:1/real_time | 2180 | 1970 | 9.63 |
And for f32_gemm_e2e_bench
, the 4x16s4 kernel is better than 5x16 kernel, and the 4x16 kernel is better than 4x16s4 kernel.
f32_gemm_4x8__fma3_broadcast/mobilenet_v1/real_time 17698 us
f32_gemm_5x8__fma3_broadcast/mobilenet_v1/real_time 14472 us
f32_gemm_6x8__fma3_broadcast/mobilenet_v1/real_time 12606 us
f32_gemm_7x8__fma3_broadcast/mobilenet_v1/real_time 12385 us
f32_gemm_8x8__fma3_broadcast/mobilenet_v1/real_time 19982 us
f32_gemm_3x16__fma3_broadcast/mobilenet_v1/real_time 13255 us
f32_gemm_4x16__fma3_broadcast/mobilenet_v1/real_time 11103 us <==== best
f32_gemm_5x16__fma3_broadcast/mobilenet_v1/real_time 16798 us <---- old
f32_gemm_3x16s4__fma3_broadcast/mobilenet_v1/real_time 13898 us
f32_gemm_4x16s4__fma3_broadcast/mobilenet_v1/real_time 11519 us <---- new
f32_gemm_5x16s4__fma3_broadcast/mobilenet_v1/real_time 16710 us
I also tested this on my Coffee Lake CPU, and the result is similar to yours: the 5x16 kernel works better than 4x16s4.
I guess the difference on MTL may be due to different execution port scheduling strategy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is the 4x16S4 does a large load and 3 shuffles, vs 5x16 does 5 small loads and that meteorlake is has slow read and/or broadcast.
But the same microarchitecture is in Granite Rapids so it will be interesting to see if mtl and spr are the same.
If not, it may be the DDR/HBM and cache.
It will be good to have uarch so we can choose for each.
Thanks for all the review suggestions, we finally find out the performance issue was caused by compiler rather than the 5x16 kernel: When building with The clang built binary shows that 5x16 kernel is better than 4x16 and 4x16s4 on MTL, so I will close this PR. Thanks again for all the comments! |
No description provided.