[QST] Slower than cublas kernel with same tile size on A100 #430

minminsun · 2022-03-15T07:02:13Z

I compared cutlass with cublas on a GEMM with M=3072, N=2048, K=768 on A100
It turned out cutlass kernel is more than 10% slower than cublas kernel, even with the same tile size.
Cublas picks kernel ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nn, which takes 62.4 us.
While the kernel cutlass_tensorop_f16_s16816gemm_f16_128x128_32x5_nn_align8 from cublas library takes 69.2 us.

So my question is what makes cutlass kernel slower than the cublas kernel with the same tile size?

Thanks!

hwu36 · 2022-03-15T12:07:33Z

What is your compiler version?

minminsun · 2022-03-15T15:44:59Z

What is your compiler version?

It's "release 11.2, V11.2.152"

hwu36 · 2022-03-15T16:15:24Z

11.2 is not enough to get the best perf. Below is the minimum version of CUDA for different type of kernel.

CUDA 11.3: GEMM
CUDA 11.4: Conv
CUDA 11.5: Sparse
CUDA 11.6: TF32x3

You will see significant performance improvement. Enjoy!

minminsun · 2022-03-15T17:26:34Z

I recompiled the library with cuda11.6, but got the same result.

hwu36 · 2022-03-15T17:43:04Z

If you open CMakeCache.txt

Does CMAKE_CUDA_COMPILER point to 11.6 compiler?

minminsun · 2022-03-16T18:10:17Z

I cleaned the build directory and then rebuild. Now the cutlass library is really compiled with cuda11.6, and the performance of cutlass kernel gets better, although it's still a little bit slower than cublas kernel.

The current comparison of GEMM(M=3072, N=2048, K=768) on A100:
nsight gpu-time: cutlass 64.8us, cublas 62.4us
measured time: cutlass 55.3us, cublas 48.3us

hwu36 · 2022-03-16T19:29:40Z

Its known that cutlass kernel can be slightly slower than cublas. See the perf chart in the README.

To compare the performance apple to apple, we need to lock the frequency. Here are the steps needed on 400W A100

sudo nvidia-smi -i 0 -pm 1                    # persistent mode
sudo nvidia-smi -lgc 1005 -i 0               # lock to 1005 MHz.  The max on A100 is 1410 MHz
sudo nvidia-smi --power-limit=400 -i 0  # lock to 400 W

BTW, the way to reset clock is

sudo nvidia-smi -rgc

cutlass also has a profiler to measure the performance, so that you don't need to use nsight or any other things.

minminsun · 2022-03-17T02:15:20Z

With frequency locked, cutlass gets the same performance as cublas. Cool!

Seems like cublas runs at the max frequency if not locked, but cutlass doesn't.

yzhaiustc · 2022-03-23T02:06:38Z

May I know the reason why you would benchmark the performance with the freq locked, or in an other word, why locking the freq can give us an "apple-to-apple" perf comparison?

Thank you! @hwu36

hwu36 · 2022-03-23T02:09:07Z

Performance is linear with the frequency. The frequency is also changing dynamically. To compare the performance fairly, we want every run to use the same frequency.

yzhaiustc · 2022-03-23T02:23:45Z

Thanks for the prompt response! Yeah I am fully with you that the freq changes at runtime dynamically.
My understanding being that, the freq typically starts to decrease to save energy at runtime when the computing units become idle due to the mem latency. cuBLAS seems better at taking the advantage of the dynamic voltage/frequency scaling strategy, and it can better keep the computing units busy.

If that is the case, the perf benchmark with no freq/power locks may also be somehow of interests to some users :)

hwu36 · 2022-03-23T02:26:46Z

My understanding being that, the freq typically starts to decrease to save energy at runtime when the computing units become idle due to the mem latency. cuBLAS seems better at taking the advantage of the dynamic voltage/frequency scaling strategy, and it can better keep the computing units busy.

No, this is not true.

yzhaiustc · 2022-03-23T02:30:56Z

Oh. Gosh.
Just curious, then how to understand [ the gap between cublas/cutlass diminishes under a locked freq ]? Thanks a lot for your time.

hwu36 · 2022-03-23T02:32:24Z

Mostly just the noise.

yzhaiustc · 2022-03-23T02:32:45Z

Thanks for the comment!

yzhaiustc · 2022-03-31T18:51:15Z

Mostly just the noise.

Oh just have time to update. Days ago I got the results: when correctly linking to an updated nvcc and with the optimal cutlass param setup, CUTLASS obtains fairly similar performance to cuBLAS, even without the freq locked.

What an amazing project. Thank you.

minminsun added ? - Needs Triage question Question labels Mar 15, 2022

mnicely closed this as completed Mar 22, 2022

fxmarty mentioned this issue Jun 28, 2023

Experiment: shared cos sin fxmarty/accelerated-pytorch-transformers-generation#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Slower than cublas kernel with same tile size on A100 #430

[QST] Slower than cublas kernel with same tile size on A100 #430

minminsun commented Mar 15, 2022

hwu36 commented Mar 15, 2022

minminsun commented Mar 15, 2022

hwu36 commented Mar 15, 2022

minminsun commented Mar 15, 2022

hwu36 commented Mar 15, 2022

minminsun commented Mar 16, 2022

hwu36 commented Mar 16, 2022 •

edited

Loading

minminsun commented Mar 17, 2022

yzhaiustc commented Mar 23, 2022 •

edited

Loading

hwu36 commented Mar 23, 2022

yzhaiustc commented Mar 23, 2022

hwu36 commented Mar 23, 2022

yzhaiustc commented Mar 23, 2022

hwu36 commented Mar 23, 2022

yzhaiustc commented Mar 23, 2022

yzhaiustc commented Mar 31, 2022

[QST] Slower than cublas kernel with same tile size on A100 #430

[QST] Slower than cublas kernel with same tile size on A100 #430

Comments

minminsun commented Mar 15, 2022

hwu36 commented Mar 15, 2022

minminsun commented Mar 15, 2022

hwu36 commented Mar 15, 2022

minminsun commented Mar 15, 2022

hwu36 commented Mar 15, 2022

minminsun commented Mar 16, 2022

hwu36 commented Mar 16, 2022 • edited Loading

minminsun commented Mar 17, 2022

yzhaiustc commented Mar 23, 2022 • edited Loading

hwu36 commented Mar 23, 2022

yzhaiustc commented Mar 23, 2022

hwu36 commented Mar 23, 2022

yzhaiustc commented Mar 23, 2022

hwu36 commented Mar 23, 2022

yzhaiustc commented Mar 23, 2022

yzhaiustc commented Mar 31, 2022

hwu36 commented Mar 16, 2022 •

edited

Loading

yzhaiustc commented Mar 23, 2022 •

edited

Loading