Production release of COSMA
This is the first production release of COSMA. It brings a lot of bug-fixes and performance improvements. Some of the most important updates are the following:
- Faster GPU backend:
- pinning/unpinning of host memory amortized
- better stream synchronization
- tiling mechanism improved
- Faster memory access: using huge pages (2M)
- Highly-optimized pxgemm (scalapack) wrapper:
- layout transformation optimized, using maximum-weighted perfect matching
- COSMA can use the initial layout directly, if the layout transformation is too expensive in some cases.
- Portability:
- Hybrid version: ported to both
NVIDIA
andAMD
GPUs. - CPU-only version: supports
MKL
,OpenBLAS
,Cray-libsci
andcustom
gemm
backends.
- Hybrid version: ported to both
- Usability:
- Trivial integration: to use our code, it is enough to link to the library, without changing the user-code.
- Spack-installable
- Bug-fixes:
- correctness tested on up to 1024 nodes of Piz Daint Supercomputer (Cray XC50).