Skip to content

v3.1

Compare
Choose a tag to compare
@harrymao2022 harrymao2022 released this 31 Mar 19:32
· 60 commits to rls-v3.1 since this release

Performance Optimizations

  • Intel Architecture Processors:

    • Improved performance for 4th generation Intel Xeon Scalable processor (formerly Sapphire Rapids).
    • Introduced initial optimizations for future Intel Xeon Scalable processor (code name Sierra Forest). The functionality is disabled by default and should be enabled via CPU dispatcher control.
  • Intel Graphics Products:

    • Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
    • Improved concat primitive performance with per-argument scales on Intel GPUs.
  • AArch64-based Processors:

    • Improved layer normalization primitive performance with Compute Library for the Arm Architecture (ACL).
  • AMD GPUs:

    • Introduced optimized matmul implementation.
  • RISC-V-based Processors:

    • Improved pooling primitive performance for processors with RISC-V vector extension (RVV) support.

Functionality

  • Enabled Graph API as a production feature. Graph API is intended to simplify oneDNN integration into frameworks.
  • Added an option to zero-out weight gradient in RNN primitive. See details in corresponding RFC.
  • [experimental] Added support for sparse memory and dense by sparse matrix-matrix multiplication support in the matmul primitive. The functionality is supported on processors with Intel AVX2 and Intel AVX-512 instruction support.
  • Introduced out-of-order queues support for OpenCL runtime. See the OpenCL Interoperability section in the Developer Guide for more details.
  • Added support for the non-zero alpha parameter in the batch normalization ReLU post-op on Intel GPUs.
  • Enabled the layer normalization primitive with f64 datatype support on Intel GPUs.
  • Added support of per-argument scales in matmul, convolution, inner product, and reorder primitives on NVIDIA GPUs.

Validation

  • Extended benchdnn with functional and performance validation for Graph API.

Breaking Changes

  • Builds with OpenCL runtime will fail unless Graph API is disabled with ONEDNN_BUILD_GRAPH=OFF.

Known Issues and Limitations

  • Graph API constant cache feature is disabled with SYCL CPU runtime due to an issue with the oneAPI DPC++ Compiler runtime. This will result in lower performance for some scenarios.

Thanks to the Contributors

This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, Annop Wongwathanarat @annop-w, @arlesniak, @bdmoore1, Crefeda Rodrigues @cfRod, David Svantesson @davsva01, Fadi Arafeh @fadara01, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, Pavel Zamelin @pazamelin, Pawel Piotrowicz @pawelpiotrowicz, Peter Caday @petercad, @ranzhejiang, and Sanchit Grover @sanchit-grover-intel. We would also like to thank everyone who asked questions and reported issues.