Skip to content

~5% performance regression for grouped GEMMs on H100 #136

@danthe3rd

Description

@danthe3rd

Hi,
I ran some benchmarks and observed a performance regression after #112 on H100. Below are the benchmarks results I get before and after:

Before #112 (03d0be3)

$ python tests/test_core.py
NVRTC version: (12, 4)
Library path:
 > ['/home/dhaziza/envs/amaia_ds3_20250709_vanilla/conda/lib/python3.10/site-packages/deep_gemm']

Testing GEMM:
 > Perf (m=   64, n= 7168, k=  576):    6 us | throughput:   93 TFLOPS,  895 GB/s
 > Perf (m=   64, n= 2112, k= 7168):   13 us | throughput:  155 TFLOPS, 1266 GB/s
 > Perf (m=   64, n=24576, k= 1536):   21 us | throughput:  230 TFLOPS, 1952 GB/s
 > Perf (m=   64, n=32768, k=  512):   12 us | throughput:  184 TFLOPS, 1796 GB/s
 > Perf (m=   64, n= 7168, k=16384):   51 us | throughput:  292 TFLOPS, 2320 GB/s
 > Perf (m=   64, n= 4096, k= 7168):   18 us | throughput:  206 TFLOPS, 1666 GB/s
 > Perf (m=   64, n= 7168, k= 2048):   10 us | throughput:  189 TFLOPS, 1582 GB/s
 > Perf (m=  128, n= 7168, k=  576):    6 us | throughput:  173 TFLOPS,  986 GB/s
 > Perf (m=  128, n= 2112, k= 7168):   14 us | throughput:  282 TFLOPS, 1209 GB/s
 > Perf (m=  128, n=24576, k= 1536):   22 us | throughput:  434 TFLOPS, 1985 GB/s
 > Perf (m=  128, n=32768, k=  512):   13 us | throughput:  336 TFLOPS, 1976 GB/s
 > Perf (m=  128, n= 7168, k=16384):   56 us | throughput:  535 TFLOPS, 2159 GB/s
 > Perf (m=  128, n= 4096, k= 7168):   20 us | throughput:  367 TFLOPS, 1531 GB/s
 > Perf (m=  128, n= 7168, k= 2048):   11 us | throughput:  336 TFLOPS, 1499 GB/s
 > Perf (m= 4096, n= 7168, k=  576):   43 us | throughput:  784 TFLOPS, 1512 GB/s
 > Perf (m= 4096, n= 2112, k= 7168):  111 us | throughput: 1121 TFLOPS,  558 GB/s
 > Perf (m= 4096, n=24576, k= 1536):  245 us | throughput: 1264 TFLOPS, 1003 GB/s
 > Perf (m= 4096, n=32768, k=  512):  147 us | throughput:  933 TFLOPS, 1949 GB/s
 > Perf (m= 4096, n= 7168, k=16384):  660 us | throughput: 1459 TFLOPS,  369 GB/s
 > Perf (m= 4096, n= 4096, k= 7168):  170 us | throughput: 1417 TFLOPS,  544 GB/s
 > Perf (m= 4096, n= 7168, k= 2048):   98 us | throughput: 1227 TFLOPS,  834 GB/s

Testing grouped contiguous GEMM:
 > Perf (num_groups= 4, expected_m_per_group=8192, n=4096, k=7168): 1510 us | throughput: 1328 TFLOPS,  425 GB/s
 > Perf (num_groups= 4, expected_m_per_group=8192, n=7168, k=2048):  771 us | throughput: 1248 TFLOPS,  773 GB/s
 > Perf (num_groups= 8, expected_m_per_group=4096, n=4096, k=7168): 1520 us | throughput: 1302 TFLOPS,  495 GB/s
 > Perf (num_groups= 8, expected_m_per_group=4096, n=7168, k=2048):  909 us | throughput: 1235 TFLOPS,  818 GB/s
 > Perf (num_groups=32, expected_m_per_group= 256, n=4096, k=7168):  561 us | throughput:  871 TFLOPS, 1903 GB/s
 > Perf (num_groups=32, expected_m_per_group= 256, n=7168, k=2048):  299 us | throughput:  843 TFLOPS, 2040 GB/s

Testing grouped masked GEMM:
 > Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168):   46 us | throughput:  988 TFLOPS,  903 GB/s
 > Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048):   34 us | throughput:  804 TFLOPS,  885 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168):   91 us | throughput:  687 TFLOPS,  823 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048):   48 us | throughput:  736 TFLOPS, 1025 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168):   95 us | throughput:  560 TFLOPS, 1379 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048):   59 us | throughput:  515 TFLOPS, 1283 GB/s

Testing weight gradient GEMM:
 > Performance (m= 7168, n= 2112, k= 4096):  138 us | throughput:  896 TFLOPS,  493 GB/s
 > Performance (m= 1536, n=24576, k= 4096):  325 us | throughput:  953 TFLOPS,  562 GB/s
 > Performance (m=  512, n=32768, k= 4096):  161 us | throughput:  854 TFLOPS, 1056 GB/s
 > Performance (m=16384, n= 7168, k= 4096): 1013 us | throughput:  950 TFLOPS,  327 GB/s
 > Performance (m= 7168, n= 4096, k= 4096):  261 us | throughput:  921 TFLOPS,  402 GB/s
 > Performance (m= 2048, n= 7168, k= 4096):  134 us | throughput:  900 TFLOPS,  502 GB/s
 > Performance (m= 7168, n= 2112, k= 8192):  246 us | throughput: 1009 TFLOPS,  432 GB/s
 > Performance (m= 1536, n=24576, k= 8192):  597 us | throughput: 1036 TFLOPS,  485 GB/s
 > Performance (m=  512, n=32768, k= 8192):  280 us | throughput:  982 TFLOPS, 1093 GB/s
 > Performance (m=16384, n= 7168, k= 8192): 1895 us | throughput: 1015 TFLOPS,  226 GB/s
 > Performance (m= 7168, n= 4096, k= 8192):  473 us | throughput: 1017 TFLOPS,  319 GB/s
 > Performance (m= 2048, n= 7168, k= 8192):  239 us | throughput: 1008 TFLOPS,  439 GB/s

Testing grouped weight gradient GEMM:
 > Performance (num_groups=4, m= 7168, n= 4096, avg_k= 4096): 1059 us | throughput:  908 TFLOPS,  396 GB/s
 > Performance (num_groups=4, m= 2048, n= 7168, avg_k= 4096):  547 us | throughput:  880 TFLOPS,  491 GB/s
 > Performance (num_groups=4, m= 7168, n= 4096, avg_k= 8192): 1935 us | throughput:  994 TFLOPS,  312 GB/s
 > Performance (num_groups=4, m= 2048, n= 7168, avg_k= 8192):  974 us | throughput:  987 TFLOPS,  430 GB/s
 > Performance (num_groups=8, m= 7168, n= 4096, avg_k= 4096): 2121 us | throughput:  907 TFLOPS,  395 GB/s
 > Performance (num_groups=8, m= 2048, n= 7168, avg_k= 4096): 1089 us | throughput:  883 TFLOPS,  493 GB/s

After update (main) - PT 2.8 nightly cu129

$ python tests/test_core.py
Testing GEMM:
 > Perf (m=  128, n= 2112, k= 7168, 1D2D, layout=NT, BF16, acc=0): launch   50 us |   13 us |  293 TFLOPS | 1256 GB/s
 > Perf (m=  128, n=24576, k= 1536, 1D2D, layout=NT, BF16, acc=0): launch   51 us |   22 us |  441 TFLOPS | 2018 GB/s
 > Perf (m=  128, n=32768, k=  512, 1D2D, layout=NT, BF16, acc=0): launch   56 us |   13 us |  338 TFLOPS | 1988 GB/s
 > Perf (m=  128, n= 7168, k=16384, 1D2D, layout=NT, BF16, acc=0): launch   50 us |   57 us |  530 TFLOPS | 2142 GB/s
 > Perf (m=  128, n= 4096, k= 7168, 1D2D, layout=NT, BF16, acc=0): launch   53 us |   20 us |  370 TFLOPS | 1544 GB/s
 > Perf (m=  128, n= 7168, k= 2048, 1D2D, layout=NT, BF16, acc=0): launch   58 us |   11 us |  344 TFLOPS | 1538 GB/s
 > Perf (m= 4096, n= 2112, k= 7168, 1D2D, layout=NT, BF16, acc=0): launch   56 us |  108 us | 1148 TFLOPS |  580 GB/s
 > Perf (m= 4096, n=24576, k= 1536, 1D2D, layout=NT, BF16, acc=0): launch   51 us |  245 us | 1261 TFLOPS | 1002 GB/s
 > Perf (m= 4096, n=32768, k=  512, 1D2D, layout=NT, BF16, acc=0): launch   53 us |  147 us |  938 TFLOPS | 1961 GB/s
 > Perf (m= 4096, n= 7168, k=16384, 1D2D, layout=NT, BF16, acc=0): launch   48 us |  659 us | 1461 TFLOPS |  373 GB/s
 > Perf (m= 4096, n= 4096, k= 7168, 1D2D, layout=NT, BF16, acc=0): launch   50 us |  169 us | 1424 TFLOPS |  552 GB/s
 > Perf (m= 4096, n= 7168, k= 2048, 1D2D, layout=NT, BF16, acc=0): launch   50 us |   97 us | 1238 TFLOPS |  845 GB/s

Testing m-grouped contiguous GEMM:
 > Perf (num_groups=4, m=35456, n= 4096, k= 7168, 1D2D, layout=NT): 1658 us | 1256 TFLOPS |  404 GB/s
 > Perf (num_groups=4, m=36096, n= 7168, k= 2048, 1D2D, layout=NT):  892 us | 1189 TFLOPS |  732 GB/s
 > Perf (num_groups=8, m=32384, n= 4096, k= 7168, 1D2D, layout=NT): 1554 us | 1224 TFLOPS |  476 GB/s
 > Perf (num_groups=8, m=31232, n= 7168, k= 2048, 1D2D, layout=NT):  779 us | 1178 TFLOPS |  811 GB/s

Testing m-grouped masked GEMM:
 > Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168, 1D2D):   79 us |  778 TFLOPS |  578 GB/s
 > Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048, 1D2D):   45 us |  729 TFLOPS |  733 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168, 1D2D):   93 us |  735 TFLOPS |  830 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048, 1D2D):   44 us |  655 TFLOPS | 1034 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168, 1D2D):   95 us |  682 TFLOPS | 1413 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048, 1D2D):   47 us |  570 TFLOPS | 1578 GB/s

Testing k-grouped contiguous GEMM:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions