-
Notifications
You must be signed in to change notification settings - Fork 738
Closed
Description
Hi,
I ran some benchmarks and observed a performance regression after #112 on H100. Below are the benchmarks results I get before and after:
Before #112 (03d0be3)
$ python tests/test_core.py
NVRTC version: (12, 4)
Library path:
> ['/home/dhaziza/envs/amaia_ds3_20250709_vanilla/conda/lib/python3.10/site-packages/deep_gemm']
Testing GEMM:
> Perf (m= 64, n= 7168, k= 576): 6 us | throughput: 93 TFLOPS, 895 GB/s
> Perf (m= 64, n= 2112, k= 7168): 13 us | throughput: 155 TFLOPS, 1266 GB/s
> Perf (m= 64, n=24576, k= 1536): 21 us | throughput: 230 TFLOPS, 1952 GB/s
> Perf (m= 64, n=32768, k= 512): 12 us | throughput: 184 TFLOPS, 1796 GB/s
> Perf (m= 64, n= 7168, k=16384): 51 us | throughput: 292 TFLOPS, 2320 GB/s
> Perf (m= 64, n= 4096, k= 7168): 18 us | throughput: 206 TFLOPS, 1666 GB/s
> Perf (m= 64, n= 7168, k= 2048): 10 us | throughput: 189 TFLOPS, 1582 GB/s
> Perf (m= 128, n= 7168, k= 576): 6 us | throughput: 173 TFLOPS, 986 GB/s
> Perf (m= 128, n= 2112, k= 7168): 14 us | throughput: 282 TFLOPS, 1209 GB/s
> Perf (m= 128, n=24576, k= 1536): 22 us | throughput: 434 TFLOPS, 1985 GB/s
> Perf (m= 128, n=32768, k= 512): 13 us | throughput: 336 TFLOPS, 1976 GB/s
> Perf (m= 128, n= 7168, k=16384): 56 us | throughput: 535 TFLOPS, 2159 GB/s
> Perf (m= 128, n= 4096, k= 7168): 20 us | throughput: 367 TFLOPS, 1531 GB/s
> Perf (m= 128, n= 7168, k= 2048): 11 us | throughput: 336 TFLOPS, 1499 GB/s
> Perf (m= 4096, n= 7168, k= 576): 43 us | throughput: 784 TFLOPS, 1512 GB/s
> Perf (m= 4096, n= 2112, k= 7168): 111 us | throughput: 1121 TFLOPS, 558 GB/s
> Perf (m= 4096, n=24576, k= 1536): 245 us | throughput: 1264 TFLOPS, 1003 GB/s
> Perf (m= 4096, n=32768, k= 512): 147 us | throughput: 933 TFLOPS, 1949 GB/s
> Perf (m= 4096, n= 7168, k=16384): 660 us | throughput: 1459 TFLOPS, 369 GB/s
> Perf (m= 4096, n= 4096, k= 7168): 170 us | throughput: 1417 TFLOPS, 544 GB/s
> Perf (m= 4096, n= 7168, k= 2048): 98 us | throughput: 1227 TFLOPS, 834 GB/s
Testing grouped contiguous GEMM:
> Perf (num_groups= 4, expected_m_per_group=8192, n=4096, k=7168): 1510 us | throughput: 1328 TFLOPS, 425 GB/s
> Perf (num_groups= 4, expected_m_per_group=8192, n=7168, k=2048): 771 us | throughput: 1248 TFLOPS, 773 GB/s
> Perf (num_groups= 8, expected_m_per_group=4096, n=4096, k=7168): 1520 us | throughput: 1302 TFLOPS, 495 GB/s
> Perf (num_groups= 8, expected_m_per_group=4096, n=7168, k=2048): 909 us | throughput: 1235 TFLOPS, 818 GB/s
> Perf (num_groups=32, expected_m_per_group= 256, n=4096, k=7168): 561 us | throughput: 871 TFLOPS, 1903 GB/s
> Perf (num_groups=32, expected_m_per_group= 256, n=7168, k=2048): 299 us | throughput: 843 TFLOPS, 2040 GB/s
Testing grouped masked GEMM:
> Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168): 46 us | throughput: 988 TFLOPS, 903 GB/s
> Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048): 34 us | throughput: 804 TFLOPS, 885 GB/s
> Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168): 91 us | throughput: 687 TFLOPS, 823 GB/s
> Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048): 48 us | throughput: 736 TFLOPS, 1025 GB/s
> Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168): 95 us | throughput: 560 TFLOPS, 1379 GB/s
> Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048): 59 us | throughput: 515 TFLOPS, 1283 GB/s
Testing weight gradient GEMM:
> Performance (m= 7168, n= 2112, k= 4096): 138 us | throughput: 896 TFLOPS, 493 GB/s
> Performance (m= 1536, n=24576, k= 4096): 325 us | throughput: 953 TFLOPS, 562 GB/s
> Performance (m= 512, n=32768, k= 4096): 161 us | throughput: 854 TFLOPS, 1056 GB/s
> Performance (m=16384, n= 7168, k= 4096): 1013 us | throughput: 950 TFLOPS, 327 GB/s
> Performance (m= 7168, n= 4096, k= 4096): 261 us | throughput: 921 TFLOPS, 402 GB/s
> Performance (m= 2048, n= 7168, k= 4096): 134 us | throughput: 900 TFLOPS, 502 GB/s
> Performance (m= 7168, n= 2112, k= 8192): 246 us | throughput: 1009 TFLOPS, 432 GB/s
> Performance (m= 1536, n=24576, k= 8192): 597 us | throughput: 1036 TFLOPS, 485 GB/s
> Performance (m= 512, n=32768, k= 8192): 280 us | throughput: 982 TFLOPS, 1093 GB/s
> Performance (m=16384, n= 7168, k= 8192): 1895 us | throughput: 1015 TFLOPS, 226 GB/s
> Performance (m= 7168, n= 4096, k= 8192): 473 us | throughput: 1017 TFLOPS, 319 GB/s
> Performance (m= 2048, n= 7168, k= 8192): 239 us | throughput: 1008 TFLOPS, 439 GB/s
Testing grouped weight gradient GEMM:
> Performance (num_groups=4, m= 7168, n= 4096, avg_k= 4096): 1059 us | throughput: 908 TFLOPS, 396 GB/s
> Performance (num_groups=4, m= 2048, n= 7168, avg_k= 4096): 547 us | throughput: 880 TFLOPS, 491 GB/s
> Performance (num_groups=4, m= 7168, n= 4096, avg_k= 8192): 1935 us | throughput: 994 TFLOPS, 312 GB/s
> Performance (num_groups=4, m= 2048, n= 7168, avg_k= 8192): 974 us | throughput: 987 TFLOPS, 430 GB/s
> Performance (num_groups=8, m= 7168, n= 4096, avg_k= 4096): 2121 us | throughput: 907 TFLOPS, 395 GB/s
> Performance (num_groups=8, m= 2048, n= 7168, avg_k= 4096): 1089 us | throughput: 883 TFLOPS, 493 GB/s
After update (main) - PT 2.8 nightly cu129
$ python tests/test_core.py
Testing GEMM:
> Perf (m= 128, n= 2112, k= 7168, 1D2D, layout=NT, BF16, acc=0): launch 50 us | 13 us | 293 TFLOPS | 1256 GB/s
> Perf (m= 128, n=24576, k= 1536, 1D2D, layout=NT, BF16, acc=0): launch 51 us | 22 us | 441 TFLOPS | 2018 GB/s
> Perf (m= 128, n=32768, k= 512, 1D2D, layout=NT, BF16, acc=0): launch 56 us | 13 us | 338 TFLOPS | 1988 GB/s
> Perf (m= 128, n= 7168, k=16384, 1D2D, layout=NT, BF16, acc=0): launch 50 us | 57 us | 530 TFLOPS | 2142 GB/s
> Perf (m= 128, n= 4096, k= 7168, 1D2D, layout=NT, BF16, acc=0): launch 53 us | 20 us | 370 TFLOPS | 1544 GB/s
> Perf (m= 128, n= 7168, k= 2048, 1D2D, layout=NT, BF16, acc=0): launch 58 us | 11 us | 344 TFLOPS | 1538 GB/s
> Perf (m= 4096, n= 2112, k= 7168, 1D2D, layout=NT, BF16, acc=0): launch 56 us | 108 us | 1148 TFLOPS | 580 GB/s
> Perf (m= 4096, n=24576, k= 1536, 1D2D, layout=NT, BF16, acc=0): launch 51 us | 245 us | 1261 TFLOPS | 1002 GB/s
> Perf (m= 4096, n=32768, k= 512, 1D2D, layout=NT, BF16, acc=0): launch 53 us | 147 us | 938 TFLOPS | 1961 GB/s
> Perf (m= 4096, n= 7168, k=16384, 1D2D, layout=NT, BF16, acc=0): launch 48 us | 659 us | 1461 TFLOPS | 373 GB/s
> Perf (m= 4096, n= 4096, k= 7168, 1D2D, layout=NT, BF16, acc=0): launch 50 us | 169 us | 1424 TFLOPS | 552 GB/s
> Perf (m= 4096, n= 7168, k= 2048, 1D2D, layout=NT, BF16, acc=0): launch 50 us | 97 us | 1238 TFLOPS | 845 GB/s
Testing m-grouped contiguous GEMM:
> Perf (num_groups=4, m=35456, n= 4096, k= 7168, 1D2D, layout=NT): 1658 us | 1256 TFLOPS | 404 GB/s
> Perf (num_groups=4, m=36096, n= 7168, k= 2048, 1D2D, layout=NT): 892 us | 1189 TFLOPS | 732 GB/s
> Perf (num_groups=8, m=32384, n= 4096, k= 7168, 1D2D, layout=NT): 1554 us | 1224 TFLOPS | 476 GB/s
> Perf (num_groups=8, m=31232, n= 7168, k= 2048, 1D2D, layout=NT): 779 us | 1178 TFLOPS | 811 GB/s
Testing m-grouped masked GEMM:
> Perf (num_groups=1, expected_m_per_group=1024, n=4096, k=7168, 1D2D): 79 us | 778 TFLOPS | 578 GB/s
> Perf (num_groups=1, expected_m_per_group=1024, n=7168, k=2048, 1D2D): 45 us | 729 TFLOPS | 733 GB/s
> Perf (num_groups=2, expected_m_per_group= 512, n=4096, k=7168, 1D2D): 93 us | 735 TFLOPS | 830 GB/s
> Perf (num_groups=2, expected_m_per_group= 512, n=7168, k=2048, 1D2D): 44 us | 655 TFLOPS | 1034 GB/s
> Perf (num_groups=4, expected_m_per_group= 256, n=4096, k=7168, 1D2D): 95 us | 682 TFLOPS | 1413 GB/s
> Perf (num_groups=4, expected_m_per_group= 256, n=7168, k=2048, 1D2D): 47 us | 570 TFLOPS | 1578 GB/s
Testing k-grouped contiguous GEMM:
Metadata
Metadata
Assignees
Labels
No labels