Skip to content

Commit

Permalink
Symmetric QGEMM kernel for ARMv8 A55 chip (#10754)
Browse files Browse the repository at this point in the history
ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions.

This change adds a Symmetric QGEMM kernel for a55 micro-architecture, where we replace

ldr q4,[x1],#16

with

ldr d4,[x1],#8
ldr x11,[x1],#8
ins v4.d[1],x11

so that we can try to hide the memory load cycles behind computing cycles in the kernel.

Co-authored-by: Chen Fu <fuchen@microsoft.com>
  • Loading branch information
chenfucn and Chen Fu authored Mar 7, 2022
1 parent 55af7a9 commit 50a6f09
Show file tree
Hide file tree
Showing 5 changed files with 955 additions and 3 deletions.
2 changes: 2 additions & 0 deletions cmake/onnxruntime_mlas.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ function(setup_mlas_source_for_windows)
${MLAS_SRC_DIR}/arm64/SgemvKernelNeon.asm
${MLAS_SRC_DIR}/arm64/SymQgemmS8KernelNeon.asm
${MLAS_SRC_DIR}/arm64/SymQgemmS8KernelSDot.asm
${MLAS_SRC_DIR}/arm64/SymQgemmS8KernelSDotLd64.asm
)
else()
target_sources(onnxruntime_mlas PRIVATE
Expand Down Expand Up @@ -290,6 +291,7 @@ else()
${MLAS_SRC_DIR}/aarch64/SgemvKernelNeon.S
${MLAS_SRC_DIR}/aarch64/SymQgemmS8KernelNeon.S
${MLAS_SRC_DIR}/aarch64/SymQgemmS8KernelSdot.S
${MLAS_SRC_DIR}/aarch64/SymQgemmS8KernelSdotLd64.S
${MLAS_SRC_DIR}/qgemm_kernel_neon.cpp
${MLAS_SRC_DIR}/qgemm_kernel_udot.cpp
${MLAS_SRC_DIR}/qgemm_kernel_sdot.cpp
Expand Down
5 changes: 3 additions & 2 deletions onnxruntime/core/mlas/lib/aarch64/SymQgemmS8KernelSdot.S
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,15 @@ Abstract:
constant. When the packed right hand side is cached, we achieves higher performance
by avoid packing all together.

This version utilizes dot product instructions, and uses 128b loads

--*/

#include "asmmacro.h"
#include "AssembleDotProduct.h"

//
// Stack frame layout for the symmetric convolution kernel.
// d8-d15, x19-x30 need to be preserved if used
// Stack frame layout d8-d15, x19-x30 need to be preserved if used
//
.equ .LGemmS8S8KernelFrame_SavedRegisters, (4 * 8)
.equ .LGemmS8S8KernelFrame_ColumnSumBuffer, (0 + .LGemmS8S8KernelFrame_SavedRegisters)
Expand Down
Loading

0 comments on commit 50a6f09

Please sign in to comment.