benchmark decoding attention kernel with cudnn #2467

bjmsong · 2024-12-12T11:32:51Z

Motivation

follow this pr， add cudnn

Modifications

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

merrymercy · 2024-12-12T22:30:44Z

Can you share any results?

bjmsong · 2024-12-13T01:43:08Z

head_num,batch_size,kv_len	SGLang[triton]	FlashInfer	cuDNN
32,1,64	265.22	270.34	430.08
32,1,128	265.73	271.36	429.57
32,1,256	275.46	285.70	440.32
32,1,512	297.98	299.01	449.54
32,1,1024	316.42	289.79	496.64
32,1,2048	346.11	293.89	522.24
32,1,4096	435.20	326.66	676.86
32,4,64	258.05	270.34	429.06
32,4,128	271.87	270.34	432.13
32,4,256	287.74	288.77	439.30
32,4,512	359.42	295.42	452.61
32,4,1024	428.03	330.75	547.84
32,4,2048	608.26	387.07	943.10
32,4,4096	1005.57	505.34	1775.62
32,16,64	288.77	270.34	442.37
32,16,128	331.78	279.04	447.49
32,16,256	420.86	314.37	565.25
32,16,512	610.30	374.78	949.25
32,16,1024	1003.52	492.54	1730.56
32,16,2048	1823.23	726.02	3349.50
32,16,4096	3307.52	1199.10	6626.30
32,64,64	579.07	311.30	563.20
32,64,128	730.11	367.62	1056.77
32,64,256	1064.96	493.57	1803.78
32,64,512	1806.34	734.21	3406.85
32,64,1024	3356.67	1218.56	6606.85
32,64,2048	6454.27	2183.17	13052.93
32,64,4096	12635.14	3829.76	26053.63
64,1,64	253.95	264.19	423.94
64,1,128	253.95	262.66	417.79
64,1,256	284.67	281.60	429.06
64,1,512	302.08	282.62	453.63
64,1,1024	343.55	288.77	475.14
64,1,2048	404.48	317.44	568.32
64,1,4096	592.90	366.59	979.97
64,4,64	263.68	264.19	430.08
64,4,128	285.70	265.22	429.06
64,4,256	321.54	276.48	459.78
64,4,512	477.18	303.10	555.01
64,4,1024	602.11	351.23	931.84
64,4,2048	992.26	455.68	1730.56
64,4,4096	1806.34	666.11	3369.47
64,16,64	380.93	277.50	464.90
64,16,128	459.78	305.15	592.90
64,16,256	624.64	351.23	980.99
64,16,512	989.18	460.80	1762.82
64,16,1024	1756.16	670.72	3357.70
64,16,2048	3312.64	1099.78	6560.77
64,16,4096	6398.98	1992.70	12994.56
64,64,64	960.51	369.66	973.82
64,64,128	1247.23	489.47	1938.43
64,64,256	1910.78	679.94	3522.05
64,64,512	3371.01	1101.31	6738.43
64,64,1024	6470.66	2007.04	13148.16
64,64,2048	12626.94	3844.10	25998.34
64,64,4096	24993.79	6835.71	47107.07

zhyncs · 2024-12-13T08:38:59Z

The hopper optimization will be released soon on FlashInfer's latest main branch. cc @yzh119

zhyncs · 2024-12-13T08:39:54Z

@bjmsong Nice work!!!

zhyncs · 2024-12-13T08:44:39Z

@bjmsong Is this data execution time, meaning the lower, the better?

bjmsong · 2024-12-13T09:54:25Z

@bjmsong Is this data execution time, meaning the lower, the better?

yeah, it's in microsecond, the lower, the better.

zhyncs · 2024-12-13T10:10:31Z

I'm just a bit confused about why cuDNN is even worse than Triton.

bjmsong · 2024-12-15T01:12:36Z

I will optimize the performance later.

bjmsong · 2024-12-16T06:50:20Z

update：only measure the kernel execution time, excluding the data processing part.

head_num,batch_size,kv_len	cuDNN	SGLang[triton]	FlashInfer
32,1,64	37.88	84.84	28.75
32,1,128	35.45	89.70	28.25
32,1,256	39.78	79.64	33.11
32,1,512	45.10	80.74	33.07
32,1,1024	55.44	82.06	34.19
32,1,2048	71.03	84.60	43.19
32,1,4096	103.06	134.33	77.41
32,4,64	34.28	84.74	28.31
32,4,128	38.45	88.59	29.74
32,4,256	52.45	90.46	33.21
32,4,512	70.65	83.49	43.79
32,4,1024	104.42	105.05	76.36
32,4,2048	166.11	172.79	139.27
32,4,4096	287.11	311.79	256.56
32,16,64	56.99	127.50	29.26
32,16,128	78.53	131.28	39.81
32,16,256	116.72	140.26	73.93
32,16,512	178.32	173.57	131.03
32,16,1024	305.11	294.56	237.77
32,16,2048	556.01	533.53	483.58
32,16,4096	1061.43	1020.02	954.83
32,64,64	165.24	445.24	75.97
32,64,128	229.19	452.60	119.45
32,64,256	366.27	478.52	223.89
32,64,512	618.63	567.01	461.61
32,64,1024	1132.89	1023.42	934.39
32,64,2048	2143.48	1957.95	1914.50
32,64,4096	3379.35	3144.74	2898.98
64,1,64	35.13	85.39	29.26
64,1,128	36.26	78.99	28.35
64,1,256	40.90	80.19	34.23
64,1,512	53.57	79.05	33.35
64,1,1024	68.45	80.47	44.00
64,1,2048	101.84	99.62	76.40
64,1,4096	160.52	167.03	130.98
64,4,64	43.80	81.62	28.39
64,4,128	52.91	82.30	29.14
64,4,256	69.67	86.65	35.52
64,4,512	102.96	105.07	69.38
64,4,1024	170.00	169.51	125.83
64,4,2048	294.75	300.52	234.91
64,4,4096	533.40	565.29	460.78
64,16,64	86.15	232.47	45.80
64,16,128	124.32	237.48	73.89
64,16,256	193.64	251.59	127.38
64,16,512	320.78	305.25	236.49
64,16,1024	575.09	531.26	484.83
64,16,2048	1077.27	994.54	960.10
64,16,4096	2073.99	1915.31	1941.34
64,64,64	281.46	867.22	133.79
64,64,128	413.01	877.68	224.82
64,64,256	679.76	919.92	453.97
64,64,512	1188.34	1082.00	930.13
64,64,1024	2208.71	1989.55	1891.40
64,64,2048	3383.27	3234.63	2957.94
64,64,4096	6510.72	6140.89	5833.15

ispobock · 2024-12-16T13:24:18Z

Did you consider warmup for these kernels before benchmark? Since the Triton kernel is JIT compiled, compilation overhead may be included.
It seems you only tested MHA here, while there are many popular models use GQA. Could you also include GQA in that benchmark?

bjmsong · 2024-12-17T08:36:20Z

Update：Add warmup & GQA

Here're some results:

(batch_size,kv_len)	cuDNN	Triton	FlashInfer
(1,64)	31.67	78.47	27.47
(1,128)	33.89	76.19	27.52
(1,256)	39.62	76.14	31.54
(1,512)	44.29	78.13	31.89
(1,1024)	54.87	82.72	34.28
(1,2048)	70.26	84.93	42.92
(1,4096)	102.61	133.10	76.31
(4,64)	32.80	79.01	32.76
(4,128)	37.38	80.16	28.25
(4,256)	53.32	77.67	32.96
(4,512)	70.30	81.66	43.75
(4,1024)	103.56	104.41	79.21
(4,2048)	164.40	172.06	144.71
(4,4096)	285.75	314.03	251.61
(16,64)	55.95	125.68	29.11
(16,128)	78.14	131.68	39.18
(16,256)	113.02	140.26	76.06
(16,512)	177.33	170.94	131.64
(16,1024)	305.33	294.01	243.98
(16,2048)	557.51	534.84	471.28
(16,4096)	1055.98	1017.17	976.81
(64,64)	165.15	444.49	75.20
(64,128)	231.59	451.85	122.77
(64,256)	362.76	478.56	225.86
(64,512)	620.90	564.61	462.78
(64,1024)	1126.41	1025.39	944.30
(64,2048)	2146.21	1953.23	1918.42
(64,4096)	3307.69	3091.87	2898.67

cuDNN	Triton	FlashInfer
31.86	84.42	31.54
34.55	79.58	31.16
36.42	82.85	37.53
40.71	81.03	36.46
45.70	87.36	36.70
51.72	80.57	39.14
61.44	82.22	43.16
34.12	87.26	32.73
34.91	81.41	34.45
40.57	84.48	38.14
45.47	78.70	38.05
56.83	82.04	47.40
73.38	83.88	56.81
107.25	117.13	89.00
37.31	84.59	32.47
39.06	81.56	32.97
55.57	80.72	33.84
73.59	87.54	49.32
108.06	108.73	82.31
171.88	179.98	136.74
296.34	320.65	250.79
70.56	136.65	54.47
91.39	141.97	62.13
128.15	150.03	89.96
195.93	216.71	140.90
322.75	344.60	247.24
581.26	594.42	472.30
1103.89	1083.44	884.71

benchmark decoding attention kernel with cudnn

c754582

zhyncs self-assigned this Dec 13, 2024

zhyncs requested review from yzh119 and zhyncs December 13, 2024 08:40

zhyncs added the high priority label Dec 13, 2024

perf kernel

1d525ac

add warmup, gqa

8adf7bf

bjmsong force-pushed the decoding branch from 25c749b to 8adf7bf Compare December 17, 2024 08:28

zhyncs requested review from merrymercy and ispobock December 17, 2024 09:00

merrymercy merged commit e210266 into sgl-project:main Dec 17, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark decoding attention kernel with cudnn #2467

benchmark decoding attention kernel with cudnn #2467

bjmsong commented Dec 12, 2024

merrymercy commented Dec 12, 2024

bjmsong commented Dec 13, 2024 •

edited

Loading

zhyncs commented Dec 13, 2024

zhyncs commented Dec 13, 2024

zhyncs commented Dec 13, 2024

bjmsong commented Dec 13, 2024

zhyncs commented Dec 13, 2024

bjmsong commented Dec 15, 2024

bjmsong commented Dec 16, 2024 •

edited

Loading

ispobock commented Dec 16, 2024 •

edited

Loading

bjmsong commented Dec 17, 2024

benchmark decoding attention kernel with cudnn #2467

benchmark decoding attention kernel with cudnn #2467

Conversation

bjmsong commented Dec 12, 2024

Motivation

Modifications

Checklist

merrymercy commented Dec 12, 2024

bjmsong commented Dec 13, 2024 • edited Loading

zhyncs commented Dec 13, 2024

zhyncs commented Dec 13, 2024

zhyncs commented Dec 13, 2024

bjmsong commented Dec 13, 2024

zhyncs commented Dec 13, 2024

bjmsong commented Dec 15, 2024

bjmsong commented Dec 16, 2024 • edited Loading

ispobock commented Dec 16, 2024 • edited Loading

bjmsong commented Dec 17, 2024

bjmsong commented Dec 13, 2024 •

edited

Loading

bjmsong commented Dec 16, 2024 •

edited

Loading

ispobock commented Dec 16, 2024 •

edited

Loading