4x16s4 fp32-gemm kernel have better performance than default(5x16) kernel for meteor lake #6480

xujuntwt95329 · 2024-05-27T15:40:46Z

XNNPACK by default uses 5x16 fp32-gemm kernel for x86_fma3, but we found that 4x16s4 kernel shows better performance on meteor lake CPU (Intel(R) Core(TM) Ultra 7 155H)

benchmark	5x16 (us)	4x16s4 (us)	Reduction on inference time (%)
FP32MobileNetV1/T:1/real_time	16193	10775	33.46
FP32MobileNetV2/T:1/real_time	8809	6626	24.78
FP32MobileNetV3Large/T:1/real_time	7756	6052	21.97
FP32MobileNetV3Small/T:1/real_time	2180	1970	9.63

Here is the code to reproduce the above data: https://github.com/xujuntwt95329/XNNPACK/tree/0143aab98634c866b319decca52590e1eb54b9dd

We can submit PR if this is welcome.

The text was updated successfully, but these errors were encountered:

fbarchard · 2024-06-25T06:18:58Z

Note that this is due to Visual C register spill. clang produces better code with 5x16

xujuntwt95329 mentioned this issue May 28, 2024

use 4x16s4 gemm kernel for meteor lake #6485

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4x16s4 fp32-gemm kernel have better performance than default(5x16) kernel for meteor lake #6480

4x16s4 fp32-gemm kernel have better performance than default(5x16) kernel for meteor lake #6480

xujuntwt95329 commented May 27, 2024

fbarchard commented Jun 25, 2024

4x16s4 fp32-gemm kernel have better performance than default(5x16) kernel for meteor lake #6480

4x16s4 fp32-gemm kernel have better performance than default(5x16) kernel for meteor lake #6480

Comments

xujuntwt95329 commented May 27, 2024

fbarchard commented Jun 25, 2024