~5% performance regression for grouped GEMMs on H100

Hi,
I ran some benchmarks and observed a performance regression after #112 on H100. Below are the benchmarks results I get before and after:

## Before #112 (03d0be3d2d03b6eed3c99d683c0620949a13a826)
```
$ python tests/test_core.py
NVRTC version: (12, 4)
Library path:
 > ['/home/dhaziza/envs/amaia_ds3_20250709_vanilla/conda/lib/python3.10/site-packages/deep_gemm']

Testing GEMM:
 > Perf (m=   64, n= 7168, k=  576):    6 us | throughput:   93 TFLOPS,  895 GB/s
 > Perf (m=   64, n= 2112, k= 7168):   13 us | throughput:  155 TFLOPS, 1266 GB/s
 > Perf (m=   64, n=24576, k= 1536):   21 us | throughput:  230 TFLOPS, 1952 GB/s
 > Perf (m=   64, n=32768, k=  512):   12 us | throughput:  184 TFLOPS, 1796 GB/s
 > Perf (m=   64, n= 7168, k=16384):   51 us | throughput:  292 TFLOPS, 2320 GB/s
 > Perf (m=   64, n= 4096, k= 7168):   18 us | throughput:  206 TFLOPS, 1666 GB/s
 > Perf (m=   64, n= 7168, k= 2048):   10 us | throughput:  189 TFLOPS, 1582 GB/s
 > Perf (m=  128, n= 7168, k=  576):    6 us | throughput:  173 TFLOPS,  986 GB/s
 > Perf (m=  128, n= 2112, k= 7168):   14 us | throughput:  282 TFLOPS, 1209 GB/s
 > Perf (m=  128, n=24576, k= 1536):   22 us | throughput:  434 TFLOPS, 1985 GB/s
 > Perf (m=  128, n=32768, k=  512):   13 us | throughput:  336 TFLOPS, 1976 GB/s
 > Perf (m=  128, n= 7168, k=16384):   56 us | throughput:  535 TFLOPS, 2159 GB/s
 > Perf (m=  128, n= 4096, k= 7168):   20 us | throughput:  367 TFLOPS, 1531 GB/s
 > Perf (m=  128, n= 7168, k= 2048):   11 us | throughput:  336 TFLOPS, 1499 GB/s
 > Perf (m= 4096, n= 7168, k=  576):   43 us | throughput:  784 TFLOPS, 1512 GB/s
 > Perf (m= 4096, n= 2112, k= 7168):  111 us | throughput: 1121 TFLOPS,  558 GB/s
 > Perf (m= 4096, n=24576, k= 1536):  245 us | throughput: 1264 TFLOPS, 1003 GB/s
 > Perf (m= 4096, n=32768, k=  512):  147 us | throughput:  933 TFLOPS, 1949 GB/s
 > Perf (m= 4096, n= 7168, k=16384):  660 us | throughput: 1459 TFLOPS,  369 GB/s
 > Perf (m= 4096, n= 4096, k= 7168):  170 us | throughput: 1417 TFLOPS,  544 GB/s
 > Perf (m= 4096, n= 7168, k= 2048):   98 us | throughput: 1227 TFLOPS,  834 GB/s

Testing grouped contiguous GEMM:
 > Perf (num_groups= 4, expected_m_per_group=8192, n=4096, k=7168): 1510 us | throughput: 1328 TFLOPS,  425 GB/s
 > Perf (num_groups= 4, expected_m_per_group=8192, n=7168, k=2048):  771 us | throughput: 1248 TFLOPS,  773 GB/s
 > Perf (num_groups= 8, expected_m_per_group=4096, n=4096, k=7168): 1520 us | throughput: 1302 TFLOPS,  495 GB/s
 > Perf (num_groups= 8, expected_m_per_group=4096, n=7168, k=2048):  909 us | throughput: 1235 TFLOPS,  818 GB/s
 > Perf (num_groups=32, expected_m_per_group= 256, n=4096, k=7168):  561 us | throughput:  871 TFLOPS, 1903 GB/s
 > Perf (num_groups=32, expected_m_per_group= 256, n=7168, k=2048):  299 us | throughput:  843 TFLOPS, 2040 GB/s

Testing grouped masked GEMM:
 > Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168):   46 us | throughput:  988 TFLOPS,  903 GB/s
 > Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048):   34 us | throughput:  804 TFLOPS,  885 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168):   91 us | throughput:  687 TFLOPS,  823 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048):   48 us | throughput:  736 TFLOPS, 1025 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168):   95 us | throughput:  560 TFLOPS, 1379 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048):   59 us | throughput:  515 TFLOPS, 1283 GB/s

Testing weight gradient GEMM:
 > Performance (m= 7168, n= 2112, k= 4096):  138 us | throughput:  896 TFLOPS,  493 GB/s
 > Performance (m= 1536, n=24576, k= 4096):  325 us | throughput:  953 TFLOPS,  562 GB/s
 > Performance (m=  512, n=32768, k= 4096):  161 us | throughput:  854 TFLOPS, 1056 GB/s
 > Performance (m=16384, n= 7168, k= 4096): 1013 us | throughput:  950 TFLOPS,  327 GB/s
 > Performance (m= 7168, n= 4096, k= 4096):  261 us | throughput:  921 TFLOPS,  402 GB/s
 > Performance (m= 2048, n= 7168, k= 4096):  134 us | throughput:  900 TFLOPS,  502 GB/s
 > Performance (m= 7168, n= 2112, k= 8192):  246 us | throughput: 1009 TFLOPS,  432 GB/s
 > Performance (m= 1536, n=24576, k= 8192):  597 us | throughput: 1036 TFLOPS,  485 GB/s
 > Performance (m=  512, n=32768, k= 8192):  280 us | throughput:  982 TFLOPS, 1093 GB/s
 > Performance (m=16384, n= 7168, k= 8192): 1895 us | throughput: 1015 TFLOPS,  226 GB/s
 > Performance (m= 7168, n= 4096, k= 8192):  473 us | throughput: 1017 TFLOPS,  319 GB/s
 > Performance (m= 2048, n= 7168, k= 8192):  239 us | throughput: 1008 TFLOPS,  439 GB/s

Testing grouped weight gradient GEMM:
 > Performance (num_groups=4, m= 7168, n= 4096, avg_k= 4096): 1059 us | throughput:  908 TFLOPS,  396 GB/s
 > Performance (num_groups=4, m= 2048, n= 7168, avg_k= 4096):  547 us | throughput:  880 TFLOPS,  491 GB/s
 > Performance (num_groups=4, m= 7168, n= 4096, avg_k= 8192): 1935 us | throughput:  994 TFLOPS,  312 GB/s
 > Performance (num_groups=4, m= 2048, n= 7168, avg_k= 8192):  974 us | throughput:  987 TFLOPS,  430 GB/s
 > Performance (num_groups=8, m= 7168, n= 4096, avg_k= 4096): 2121 us | throughput:  907 TFLOPS,  395 GB/s
 > Performance (num_groups=8, m= 2048, n= 7168, avg_k= 4096): 1089 us | throughput:  883 TFLOPS,  493 GB/s
```

## After update (main) - PT 2.8 nightly cu129
```
$ python tests/test_core.py
Testing GEMM:
 > Perf (m=  128, n= 2112, k= 7168, 1D2D, layout=NT, BF16, acc=0): launch   50 us |   13 us |  293 TFLOPS | 1256 GB/s
 > Perf (m=  128, n=24576, k= 1536, 1D2D, layout=NT, BF16, acc=0): launch   51 us |   22 us |  441 TFLOPS | 2018 GB/s
 > Perf (m=  128, n=32768, k=  512, 1D2D, layout=NT, BF16, acc=0): launch   56 us |   13 us |  338 TFLOPS | 1988 GB/s
 > Perf (m=  128, n= 7168, k=16384, 1D2D, layout=NT, BF16, acc=0): launch   50 us |   57 us |  530 TFLOPS | 2142 GB/s
 > Perf (m=  128, n= 4096, k= 7168, 1D2D, layout=NT, BF16, acc=0): launch   53 us |   20 us |  370 TFLOPS | 1544 GB/s
 > Perf (m=  128, n= 7168, k= 2048, 1D2D, layout=NT, BF16, acc=0): launch   58 us |   11 us |  344 TFLOPS | 1538 GB/s
 > Perf (m= 4096, n= 2112, k= 7168, 1D2D, layout=NT, BF16, acc=0): launch   56 us |  108 us | 1148 TFLOPS |  580 GB/s
 > Perf (m= 4096, n=24576, k= 1536, 1D2D, layout=NT, BF16, acc=0): launch   51 us |  245 us | 1261 TFLOPS | 1002 GB/s
 > Perf (m= 4096, n=32768, k=  512, 1D2D, layout=NT, BF16, acc=0): launch   53 us |  147 us |  938 TFLOPS | 1961 GB/s
 > Perf (m= 4096, n= 7168, k=16384, 1D2D, layout=NT, BF16, acc=0): launch   48 us |  659 us | 1461 TFLOPS |  373 GB/s
 > Perf (m= 4096, n= 4096, k= 7168, 1D2D, layout=NT, BF16, acc=0): launch   50 us |  169 us | 1424 TFLOPS |  552 GB/s
 > Perf (m= 4096, n= 7168, k= 2048, 1D2D, layout=NT, BF16, acc=0): launch   50 us |   97 us | 1238 TFLOPS |  845 GB/s

Testing m-grouped contiguous GEMM:
 > Perf (num_groups=4, m=35456, n= 4096, k= 7168, 1D2D, layout=NT): 1658 us | 1256 TFLOPS |  404 GB/s
 > Perf (num_groups=4, m=36096, n= 7168, k= 2048, 1D2D, layout=NT):  892 us | 1189 TFLOPS |  732 GB/s
 > Perf (num_groups=8, m=32384, n= 4096, k= 7168, 1D2D, layout=NT): 1554 us | 1224 TFLOPS |  476 GB/s
 > Perf (num_groups=8, m=31232, n= 7168, k= 2048, 1D2D, layout=NT):  779 us | 1178 TFLOPS |  811 GB/s

Testing m-grouped masked GEMM:
 > Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168, 1D2D):   79 us |  778 TFLOPS |  578 GB/s
 > Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048, 1D2D):   45 us |  729 TFLOPS |  733 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168, 1D2D):   93 us |  735 TFLOPS |  830 GB/s
 > Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048, 1D2D):   44 us |  655 TFLOPS | 1034 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168, 1D2D):   95 us |  682 TFLOPS | 1413 GB/s
 > Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048, 1D2D):   47 us |  570 TFLOPS | 1578 GB/s

Testing k-grouped contiguous GEMM:
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

~5% performance regression for grouped GEMMs on H100 #136

Before #112 (`03d0be3`)

After update (main) - PT 2.8 nightly cu129

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

~5% performance regression for grouped GEMMs on H100 #136

Description

Before #112 (03d0be3)

After update (main) - PT 2.8 nightly cu129

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Before #112 (`03d0be3`)