How to modify oneDNN to enable GEMM operation acceleration on your own hardware #2114

nanzh-19 · 2024-09-24T06:53:58Z

My use case is inference acceleration on a CPU using TensorFlow Serving, and my hardware architecture is AArch64 (ARMv8). Currently, I've noticed that with oneDNN enabled, the performance bottleneck is in GEMM. I want to create a fused operator for GEMM and ReLU. Which parts of the code should I modify to improve performance? Thank you for your assistance!

mgouicem · 2024-09-24T08:12:49Z

Hi @nanzh-19 you can fuse relu with matmul operator at the oneDNN API level by using post-ops. You can find a full example of Matmul + ReLU here.

Tagging @milpuz01 @cfRod @jondea for guidance on Tensorflow integration.

nanzh-19 · 2024-09-24T08:25:30Z

Hi @mgouicem. Thank you for your response! We are aiming to optimize inference on an unpublished ARM architecture machine. It seems that oneDNN might not have specific information about our hardware, which could explain why we aren't achieving optimal performance. My understanding is that oneDNN is primarily optimized for existing ARM architectures. Is this correct? I appreciate your insights!

theComputeKid · 2024-09-24T09:43:38Z

@nanzh-19 : as @mgouicem mentioned, oneDNN can pass down a fused GEMM and ReLU to ACL, where it can execute an optimised operation.

In addition to his example, you can pass activation info in GEMMInfo here: https://github.com/ARM-software/ComputeLibrary/blob/de7288cb71e6b9190f52e50a44ed68c309e4a041/arm_compute/function_info/GEMMInfo.h#L86

And then specify ReLU as here: https://github.com/ARM-software/ComputeLibrary/blob/de7288cb71e6b9190f52e50a44ed68c309e4a041/arm_compute/function_info/ActivationLayerInfo.h#L49

Even if you have unreleased hardware, it is important to note that we optimise for architectural features such as SVE/NEON/SME etc, rather than the machine description or vendor. So I believe that you should get good performance.

Hope that helps, feel free to inquire more.

nanzh-19 · 2024-09-24T11:22:57Z

Thank you for your comment. I have the following questions:

We believe that OneDNN has not been thoroughly optimized for our machine. For example, OneDNN is not aware of our machine's memory structure, which may lead to suboptimal matrix blocking for GEMM. Therefore, we would like to optimize based on OneDNN's code.

The reason for the lack of optimization of OneDNN for our machine is that we compared x86 servers with our own servers. When observing inference performance on a single NUMA node, we found that our machine's performance significantly degraded after using OneDNN.

The data in the figure represents the inference throughput.

mgouicem · 2024-09-24T11:55:36Z

@nanzh-19 there could multiple things at play here and log files might be helpful (if you can share any).
In general here are a few things:

did you build oneDNN with acl? Or did you use oneDNN as part of a framework (and if the latter, which one)?
your OS might have to include support for your custom hardware. In particular, aarch64 jitted implementations get the system topology using a mix of hwcap and system files, see Linux code here.
which threading runtime did you use ?

nanzh-19 · 2024-09-27T07:12:00Z

Thank you for @mgouicem 's comment. I've identified the reason for the performance discrepancy: on the ARM architecture, the ACL's arm_gemm is being called, while on the x86 architecture, brgemm is used. My current issue is how to modify TensorFlow's calls so that brgemm can be used on the ARM architecture as well.

theComputeKid · 2024-09-27T10:43:44Z

Depending on how you got/built tensorflow, I believe aarch64 has a much older version of oneDNN than x86. If you want to investigate further, you could also try using benchdnn directly on the latest oneDNN from main and see the difference for your use case.

nanzh-19 added the question label Sep 24, 2024

mgouicem added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to modify oneDNN to enable GEMM operation acceleration on your own hardware #2114

How to modify oneDNN to enable GEMM operation acceleration on your own hardware #2114

nanzh-19 commented Sep 24, 2024

mgouicem commented Sep 24, 2024

nanzh-19 commented Sep 24, 2024

theComputeKid commented Sep 24, 2024

nanzh-19 commented Sep 24, 2024 •

edited

Loading

mgouicem commented Sep 24, 2024

nanzh-19 commented Sep 27, 2024 •

edited

Loading

theComputeKid commented Sep 27, 2024

How to modify oneDNN to enable GEMM operation acceleration on your own hardware #2114

How to modify oneDNN to enable GEMM operation acceleration on your own hardware #2114

Comments

nanzh-19 commented Sep 24, 2024

mgouicem commented Sep 24, 2024

nanzh-19 commented Sep 24, 2024

theComputeKid commented Sep 24, 2024

nanzh-19 commented Sep 24, 2024 • edited Loading

mgouicem commented Sep 24, 2024

nanzh-19 commented Sep 27, 2024 • edited Loading

theComputeKid commented Sep 27, 2024

nanzh-19 commented Sep 24, 2024 •

edited

Loading

nanzh-19 commented Sep 27, 2024 •

edited

Loading