Skip to content
This repository has been archived by the owner on Aug 11, 2020. It is now read-only.

Improve batch gemm performance using MKL #342

Merged
merged 6 commits into from
Jun 23, 2018
Merged

Conversation

xinyu-intel
Copy link
Member

This pr is to improve the performance of small size matrix batch gemm around 5-10x by using MKL. This optimization will be useful for attention layer in sockeye.

Performance comparison:

1000 loops

size mshadow MKL
[1120, 10, 256] * [1120, 256, 10] 1.4739921093 0.180208921432
[1120, 40, 512] * [1120, 512, 1] 3.45011711121 0.670109033585

@pengzhao-intel

@pengzhao-intel
Copy link

FYI, @fhieber @tdomhan @mjpost

@pengzhao-intel
Copy link

@piiswrong @sxjscience please help take a review :) thanks in advance.

@@ -291,11 +292,48 @@ struct BLASEngine<cpu, float> {
const float *A, int lda, const float *B, int ldb,
float beta, float *C, int ldc, int batch_count,
float **workspace) {
#if MSHADOW_USE_MKL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cblas_sgemm_batch and cblas_dgemm_batch generally supported in MKL? Do we need to check the version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this page, Intel MKL 11.3 Beta (part of Intel® Parallel Studio XE 2016 Beta) includes a new flavor of GEMM feature called "Batch GEMM".

@piiswrong piiswrong merged commit 757a91c into dmlc:master Jun 23, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants