clBLAS-2.8.0 Release for ACL 1.0 Beta 2
This clBLAS release is tagged as v2.8 is part of AMD Compute Libraries (ACL) 1.0 beta 2. This release is based on merge from develop branch to master branch.
The highlights of the release:
- Introduced AutoGemm, the new high-performing GEneric Matrix Matrix multiplication (GEMM) backend for clBLAS, is a suite of Python scripts which:
- generates thousands of optimized GEMM OpenCL kernels
- benchmarks these kernels for a particular GPU and different matrix sizes to determine which are the fastest
- automatically chooses the optimal kernel within clBLAS for peak performance
- allows applications with unique GEMM requirements (such as very small or very skinny matrices) to generate customized application-specific GEMM kernels for additional performance.
- Incorporated new faster DTRSM algorithm that:
- enable the use of more hardware friendly algorithm for both online and offline compilation
- leverages the DGEMM performance improvement from AutoGemm
- MISC
- fixes SGEMM performance drop at big multiples of 1024
- fixes DGEMM performance drop at big sizes (ranging from 18000 by 18000 to 36000 by 36000)
- supports Visual Studio 2015
- adds CI support of Windows and Mac OS