-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use 4x16s4 gemm kernel for meteor lake #6485
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -789,6 +789,7 @@ static void init_f32_gemm_config(void) { | |||||||||||||||||||||
switch (cpuinfo_get_core(0)->uarch) { | ||||||||||||||||||||||
case cpuinfo_uarch_zen: | ||||||||||||||||||||||
case cpuinfo_uarch_dhyana: | ||||||||||||||||||||||
case cpuinfo_uarch_meteor_lake: | ||||||||||||||||||||||
f32_gemm_config.minmax.gemm[XNN_MR_TO_INDEX(1)] = xnn_init_hmp_gemm_ukernel((xnn_gemm_ukernel_fn) xnn_f32_gemm_minmax_ukernel_1x16s4__fma3_broadcast); | ||||||||||||||||||||||
f32_gemm_config.minmax.gemm[XNN_MR_TO_INDEX(4)] = xnn_init_hmp_gemm_ukernel((xnn_gemm_ukernel_fn) xnn_f32_gemm_minmax_ukernel_4x16s4__fma3_broadcast); | ||||||||||||||||||||||
f32_gemm_config.minmax.igemm[XNN_MR_TO_INDEX(1)] = xnn_init_hmp_igemm_ukernel((xnn_igemm_ukernel_fn) xnn_f32_igemm_minmax_ukernel_1x16s4__fma3_broadcast); | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. xnn_f32_gemm_minmax_ukernel_4x16s4__fma3_broadcast is faster than xnn_f32_gemm_minmax_ukernel_5x16__fma3_broadcast? The shuffle it does is 7 cycles for 4 floats, vs the avx_broadcast is 3 cycles for 1 float. A 5x16 or 7x16 tends to work better than 4x16 but the s4 must be low on registers. The f32_gemm_e2e_bench benchmark is an easy test against mobilenet On Sapphire Rapids But these are xeon's with fast memory/cache and older uarch. Can you run e2e and/or after making the change, run end2end_bench to make sure f32 is faster on mobilenet v1, v2 and v3? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @fbarchard thanks for the explain and suggestion, I tested both For end2end_bench, the 4x16s4 kernel is indeed faster than the 5x16 one
And for f32_gemm_4x8__fma3_broadcast/mobilenet_v1/real_time 17698 us
f32_gemm_5x8__fma3_broadcast/mobilenet_v1/real_time 14472 us
f32_gemm_6x8__fma3_broadcast/mobilenet_v1/real_time 12606 us
f32_gemm_7x8__fma3_broadcast/mobilenet_v1/real_time 12385 us
f32_gemm_8x8__fma3_broadcast/mobilenet_v1/real_time 19982 us
f32_gemm_3x16__fma3_broadcast/mobilenet_v1/real_time 13255 us
f32_gemm_4x16__fma3_broadcast/mobilenet_v1/real_time 11103 us <==== best
f32_gemm_5x16__fma3_broadcast/mobilenet_v1/real_time 16798 us <---- old
f32_gemm_3x16s4__fma3_broadcast/mobilenet_v1/real_time 13898 us
f32_gemm_4x16s4__fma3_broadcast/mobilenet_v1/real_time 11519 us <---- new
f32_gemm_5x16s4__fma3_broadcast/mobilenet_v1/real_time 16710 us I also tested this on my Coffee Lake CPU, and the result is similar to yours: the 5x16 kernel works better than 4x16s4. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My guess is the 4x16S4 does a large load and 3 shuffles, vs 5x16 does 5 small loads and that meteorlake is has slow read and/or broadcast. |
||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI we will need to manually patch this meteorlake detect into our internal cpuinfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I've submitted a PR to cpuinfo repo: pytorch/cpuinfo#247
If the review takes too much time, we can go ahead with manual patch to internal cpuinfo