[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884

ShangmingCai · 2024-12-04T03:01:40Z

We really appreciate @KuntaiDu for his remarkable work in supporting the disaggregated prefill feature in vLLM. Since PR #10502 has been merged. After rebase, we switch the mooncake integration from PR #10728 to here.

This PR is related to #10727, as well as a continuation of PR #10502, which uses Mooncake's Transfer Engine for KVCache transfer instead of NCCL.

Mooncake is a KVCache-centric disaggregated architecture for LLM serving. Transfer Engine is the core component of Mooncake, see documentations for its design & API list.

Compared with NCCL, Mooncake Transfer Engine has the following features:

a unified programming interface for data transfers between DRAM-to-DRAM (both local and remote), DRAM-to-GPU VRAM (both local and remote), and DRAM-to-remote NVMe devices
support for TCP, RDMA, and NVMe-of protocols
topology-aware path selection (link to our English doc, transfer_engine.md), aggregating bandwidth from multiple NICs

Like the current implementation of PR #10502, there are two roles: KV provider (e.g. prefill vLLM instance) and KV consumer (e.g. decode vLLM instance)

Provider side implements insert: insert a KV cache into a buffer, so that it can be transferred upon request
Consumer side implements drop_select: select a KV cache based on tokens, transfer the selected KV, and drop this KV out from the buffer

Both roles are run on different machines.

Integration guide: https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm-integration-v0.2.md

Benchmark result: https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm-benchmark-results-v0.2.md
More benchmark results will be added in the future.

Test files will be added to align with the future test CI pipeline for PR #10502.

CC List.
@KuntaiDu @youkaichao @alogfans @stmatengss @james0zan

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

github-actions · 2024-12-04T03:01:54Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

KuntaiDu · 2024-12-06T06:47:20Z

Now working on OSDI submission, will review after Dec 10.

Jeffwan · 2024-12-08T19:19:15Z

This is a great demonstration to adopt mooncake to current disaggregation implementation. Could you share some benchmark data and best practice here? transfer engine's primary feature like more protocols support, topology-aware path selection would be beneficial in larger scale clusters. I am just curious how mooncake perform in 1P1D simple case or isomorphic environments.

ShangmingCai · 2024-12-09T02:51:34Z

This is a great demonstration to adopt mooncake to current disaggregation implementation. Could you share some benchmark data and best practice here? transfer engine's primary feature like more protocols support, topology-aware path selection would be beneficial in larger scale clusters. I am just curious how mooncake perform in 1P1D simple case or isomorphic environments.

Here are some preview mooncake benchmark results on A10 with up to 2 RDMA NICs. I am currently having some trouble benchmarking PyNcclConnector now. For some unknown reasons, it crashes a lot for inter-node disaggregated scenarios. And I am digging into the lookup_buffer and connector to try to identify the root cause. But I haven't found it. So the benchmark results haven't included the PyNcclConnector yet.

Varying tp (input length = 1024, qps = 2, output length =6)

Setting	num_rdma_nic	Successful Requests	Duration (s)	Total Input Tokens	Total Generated Tokens	Req Throughput (req/s)	Output Token Throughput (tok/s)	Total Token Throughput (tok/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean TPOT (ms)	Median TPOT (ms)	P99 TPOT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
tp = 1	2	200	99.47	201995	1200	2.01	12.06	2042.74	1056.76	635.00	4006.59	97.08	26.94	781.91	97.01	14.05	2205.51
tp = 2	2	200	98.98	201995	1200	2.02	12.12	2052.95	314.87	231.20	949.40	25.65	15.56	129.60	25.62	15.48	288.06
tp = 4	2	200	98.76	201995	1200	2.03	12.15	2057.44	198.10	160.03	461.61	23.52	18.93	94.38	23.50	18.01	187.79
tp = 1	1	200	99.44	201995	1200	2.01	12.07	2043.39	1071.12	631.56	4361.02	83.93	26.93	794.75	83.86	14.13	1932.66
tp = 2	1	200	98.96	201995	1200	2.02	12.13	2053.35	335.26	258.30	997.93	28.84	15.56	144.82	28.80	15.42	397.56
tp = 4	1	200	98.78	201995	1200	2.02	12.15	2057.03	201.68	162.85	456.33	22.31	16.74	94.76	22.29	16.73	189.13
tp = 1	TCP	200	99.55	201995	1200	2.01	12.05	2041.13	1414.05	766.23	6035.36	155.01	35.28	1191.24	154.91	14.32	3148.99
tp = 2	TCP	200	98.97	201995	1200	2.02	12.12	2053.03	333.74	251.32	954.63	28.74	15.49	161.24	28.70	15.35	393.52
tp = 4	TCP	200	98.78	201995	1200	2.02	12.15	2056.94	205.37	162.92	463.70	21.54	16.51	94.04	21.51	16.56	170.54

Varying qps (length = 1024, tp = 4, output length =6)

Setting	num_rdma_nic	Successful Requests	Duration (s)	Total Input Tokens	Total Generated Tokens	Req Throughput (req/s)	Output Token Throughput (tok/s)	Total Token Throughput (tok/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean TPOT (ms)	Median TPOT (ms)	P99 TPOT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
qps = 2	2	200	98.77	201995	1200	2.02	12.15	2057.33	200.64	156.62	478.22	22.63	17.35	99.61	22.60	17.08	186.25
qps = 4	2	200	49.75	201995	1200	4.02	24.12	4084.03	341.88	240.68	1430.54	38.36	18.39	313.45	38.31	17.17	588.80
qps = 6	2	200	33.44	201995	1200	5.98	35.88	6075.54	851.15	501.59	3239.89	102.51	47.67	606.77	102.34	18.35	1704.79
qps = 8	2	200	27.16	201995	1200	7.36	44.19	7482.52	4835.08	5733.45	8846.27	1276.59	1150.11	4401.23	1274.43	48.34	20682.35
qps = 2	1	200	98.77	201995	1200	2.02	12.15	2057.31	201.77	161.53	473.44	22.13	16.52	96.18	22.11	16.51	190.40
qps = 4	1	200	49.76	201995	1200	4.02	24.12	4083.83	337.31	243.38	1395.85	39.95	17.61	325.39	39.88	17.06	838.68
qps = 6	1	200	33.44	201995	1200	5.98	35.88	6075.99	820.53	458.84	3169.52	83.92	30.50	663.07	83.78	17.85	1306.32
qps = 8	1	200	27.19	201995	1200	7.36	44.14	7473.44	5291.91	6160.55	9596.56	1190.36	1040.63	4418.66	1188.33	47.61	20815.23
qps = 2	TCP	200	98.76	201995	1200	2.03	12.15	2057.42	207.22	160.81	511.01	22.17	16.59	94.96	22.15	16.59	181.82
qps = 4	TCP	200	49.79	201995	1200	4.02	24.10	4081.06	355.43	252.63	1554.91	40.15	16.92	314.28	40.09	16.66	708.50
qps = 6	TCP	200	33.49	201995	1200	5.97	35.83	6067.71	907.74	514.85	3253.93	122.75	45.51	648.40	122.56	18.09	2282.92
qps = 8	TCP	200	28.39	201995	1200	7.04	42.26	7156.09	6714.57	7885.09	11787.51	1116.06	408.32	4645.25	1114.29	46.87	21898.03

Varying input length (tp = 4, qps = 2, output length =6)

Setting	num_rdma_nic	Successful Requests	Duration (s)	Total Input Tokens	Total Generated Tokens	Req Throughput (req/s)	Output Token Throughput (tok/s)	Total Token Throughput (tok/s)	Mean TTFT (ms)	Median TTFT (ms)	P99 TTFT (ms)	Mean TPOT (ms)	Median TPOT (ms)	P99 TPOT (ms)	Mean ITL (ms)	Median ITL (ms)	P99 ITL (ms)
1024	2	200	98.77	201995	1200	2.02	12.15	2057.32	195.47	151.55	482.84	22.83	19.27	96.55	22.81	18.12	158.16
2048	2	200	99.22	406707	1200	2.02	12.09	4110.95	723.76	488.67	2941.96	67.25	18.93	632.73	67.20	17.49	1209.54
4096	2	200	117.42	818415	1200	1.70	10.22	6979.90	14616.48	18323.82	23191.04	8042.84	7593.16	19851.11	8040.02	65.43	93511.26
8192	2	200	247.77	1636065	1200	0.81	4.84	6608.10	75783.36	79331.60	147544.42	16961.27	15140.11	39278.98	16958.32	90.01	186151.61
1024	1	200	98.77	201995	1200	2.02	12.15	2057.31	201.77	161.53	473.44	22.13	16.52	96.18	22.11	16.51	190.40
2048	1	200	99.25	406707	1200	2.02	12.09	4109.96	719.43	482.02	3208.13	61.92	17.64	681.26	61.86	16.83	978.90
4096	1	200	111.88	818415	1200	1.79	10.73	7326.16	20362.10	22807.05	31853.55	5915.16	4521.51	18739.12	5913.18	67.03	81600.29
8192	1	200	270.01	1636065	1200	0.74	4.44	6063.79	103355.40	106546.65	172025.11	12894.35	11027.66	35110.13	12892.85	64.84	151774.68
1024	TCP	200	98.81	201995	1200	2.02	12.14	2056.44	203.32	160.83	460.90	21.81	16.96	95.27	21.78	16.91	171.80
2048	TCP	200	99.27	406707	1200	2.01	12.09	4108.98	731.60	484.78	3213.69	68.55	17.88	639.93	68.49	17.33	1257.45
4096	TCP	200	118.37	818415	1200	1.69	10.14	6923.89	23735.69	27101.97	36573.47	6386.62	5102.00	20032.26	6384.71	69.57	92811.27
8192	TCP	200	278.12	1636065	1200	0.72	4.31	5886.95	106873.23	109941.33	179781.64	13360.87	12155.24	36022.96	13359.20	68.01	156716.38

For best practice, I believe there is no best practice before XpYd is ready. But if you want to test the mooncake transfer engine, you can follow the guidance doc to reproduce the results.

In addition, we are also coordinating resources to integrate some machines with more RDMA NICs and more advanced GPUs. The official benchmark results will be released in due time.

…a and Turing GPUs. Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

junna2016 · 2024-12-13T03:21:26Z

Use latest mooncake code, when I test with tp=1, num_rdma_nic=2, qps=2, input_len=200, output_len=100 in a single machine, which prefill instance num is 1 and decode instance num also is 1.
My mooncake_config.json is shown as below:

{
"prefill_url": "127.0.0.1:8144",
"decode_url": "127.0.0.1:8149",
"metadata_server": "127.0.0.1:2333",
"metadata_backend": "etcd",
"protocol": "rdma",
"device_name": "mlx5_0,mlx5_1"
}

There will occur an error in transfer_engine:

E1213 02:57:10.528410 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:14.286239 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded
E1213 02:57:18.044381 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:21.802461 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded

And with one rdma device(mlx5_0 or mlx5_1) is ok

ShangmingCai · 2024-12-13T03:31:40Z

@junna2016 I think the errors are reported from the underlying transfer engine, please open an issue in the Mooncake repo and we will get someone to help identify the root cause.

junna2016 · 2024-12-13T03:34:18Z

@junna2016 I think the errors are reported from the underlying transfer engine, please open an issue in the Mooncake repo and we will get someone to help identify the root cause.

kvcache-ai/Mooncake#35

KuntaiDu

Clean implementation! I left some comments PTAL.

vllm/distributed/kv_transfer/kv_connector/mooncake_connector.py

KuntaiDu · 2024-12-14T08:10:05Z

vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py

+        if self.transport_thread is None:
+            self.transport_thread = ThreadPoolExecutor(max_workers=1)
+        tensor = self.transport_thread.submit(self._recv_impl).result()
+        if tensor.numel() == 1 and tensor.item() == NONE_INT:


This part is a bit tricky --- tensor.item() can return wrong value when I use vLLM's PyNccl to transmit tensor but print(tensor) can return the right value. You don't need to change any code here if you don't see any bug when stress testing this code (I have a stress test code in tests/kv_transfer/test_send_recv.sh you can try to use that).

For now, MooncakePipe has not encountered similar problems during stress testing and benchmarking. However, will keep this part in mind and make corresponding modifications and tests as PyNcclPipe changes.

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

KuntaiDu

LGTM! Thank you for your contribution!

liweiqing1997 · 2024-12-18T11:12:53Z

这将 Mooncake 实现目前的一个很好的例子。您可以分享一些基准数据和最佳实践吗？传输引擎的主要功能（如更多协议支持、拓扑拓扑路径选择）在更大规模的我只是好奇月饼在1P1D简单情况或同构环境中的表现如何。

以下是在最多配备 2 个 RDMA NIC 的 A10 上进行的一些月饼基准测试结果。我现在在基准测试中遇到了麻烦PyNcclConnector。由于某些未知原因，它在节点间分割场景中经常崩溃。我正在研究研究lookup_buffer并connector试图找出根本原因。但我还没有找到它。所以基准测试结果PyNcclConnector还不包括。

变化的tp（输入长度 = 1024、qps = 2、输出长度=6）

的 num_rdma_nic 国家的请求持续时间（秒）总输入项目代币总生成量请求吞吐量 (req/s) 输出代币吞吐量 (tok/s) 总吞吐量（tok/s）平均TTFT（毫秒）中位TTFT（毫秒） P99 TTFT（毫秒）平均TPOT（毫秒）中位TPOT（毫秒） P99 TPOT（毫秒）平均ITL（毫秒）中位ITL（毫秒） P99 ITL（毫秒）
tp = 1 2 200 99.47 201995 1200 2.01 12.06 2042.74 1056.76 635.00 4006.59 97.08 26.94 781.91 97.01 14.05 2205.51
tp = 2 2 200 98.98 201995 1200 2.02 12.12 2052.95 314.87 231.20 949.40 25.65 15.56 129.60 25.62 15.48 288.06
tp = 4 2 200 98.76 201995 1200 2.03 12.15 2057.44 198.10 160.03 461.61 23.52 18.93 94.38 23.50 18.01 187.79
tp = 1 1 200 99.44 201995 1200 2.01 12.07 2043.39 1071.12 631.56 4361.02 83.93 26.93 794.75 83.86 14.13 1932.66
tp = 2 1 200 98.96 201995 1200 2.02 12.13 2053.35 335.26 258.30 997.93 28.84 15.56 144.82 28.80 15.42 397.56
tp = 4 1 200 98.78 201995 1200 2.02 12.15 2057.03 201.68 162.85 456.33 22.31 16.74 94.76 22.29 16.73 189.13
tp = 1 TCP 200 99.55 201995 1200 2.01 12.05 2041.13 1414.05 766.23 6035.36 155.01 35.28 1191.24 154.91 14.32 3148.99
tp = 2 TCP 200 98.97 201995 1200 2.02 12.12 2053.03 333.74 251.32 954.63 28.74 15.49 161.24 28.70 15.35 393.52
tp = 4 TCP 200 98.78 201995 1200 2.02 12.15 2056.94 205.37 162.92 463.70 21.54 16.51 94.04 21.51 16.56 170.54

变化的qps（长度=1024、tp=4、输出长度=6）

的 num_rdma_nic 国家的请求持续时间（秒）总输入项目代币总生成量请求吞吐量 (req/s) 输出代币吞吐量 (tok/s) 总吞吐量（tok/s）平均TTFT（毫秒）中位TTFT（毫秒） P99 TTFT（毫秒）平均TPOT（毫秒）中位TPOT（毫秒） P99 TPOT（毫秒）平均ITL（毫秒）中位ITL（毫秒） P99 ITL（毫秒）
每秒查询次数 = 2 2 200 98.77 201995 1200 2.02 12.15 2057.33 200.64 156.62 478.22 22.63 17.35 99.61 22.60 17.08 186.25
每秒查询次数 = 4 2 200 49.75 201995 1200 4.02 24.12 4084.03 341.88 240.68 1430.54 38.36 18.39 313.45 38.31 17.17 588.80
每週查數量 = 6 2 200 33.44 201995 1200 5.98 35.88 6075.54 851.15 501.59 3239.89 102.51 47.67 606.77 102.34 18.35 1704.79
每秒查询速度 = 8 2 200 27.16 201995 1200 7.36 44.19 7482.52 4835.08 5733.45 8846.27 1276.59 1150.11 4401.23 1274.43 48.34 20682.35
每秒查询次数 = 2 1 200 98.77 201995 1200 2.02 12.15 2057.31 201.77 161.53 473.44 22.13 16.52 96.18 22.11 16.51 190.40
每秒查询次数 = 4 1 200 49.76 201995 1200 4.02 24.12 4083.83 337.31 243.38 1395.85 39.95 17.61 325.39 39.88 17.06 838.68
每週查數量 = 6 1 200 33.44 201995 1200 5.98 35.88 6075.99 820.53 458.84 3169.52 83.92 30.50 663.07 83.78 17.85 1306.32
每秒查询速度 = 8 1 200 27.19 201995 1200 7.36 44.14 7473.44 5291.91 6160.55 9596.56 1190.36 1040.63 4418.66 1188.33 47.61 20815.23
每秒查询次数 = 2 TCP 200 98.76 201995 1200 2.03 12.15 2057.42 207.22 160.81 511.01 22.17 16.59 94.96 22.15 16.59 181.82
每秒查询次数 = 4 TCP 200 49.79 201995 1200 4.02 24.10 4081.06 355.43 252.63 1554.91 40.15 16.92 314.28 40.09 16.66 708.50
每週查數量 = 6 TCP 200 33.49 201995 1200 5.97 35.83 6067.71 907.74 514.85 3253.93 122.75 45.51 648.40 122.56 18.09 2282.92
每秒查询速度 = 8 TCP 200 28.39 201995 1200 7.04 42.26 7156.09 6714.57 7885.09 11787.51 1116.06 408.32 4645.25 1114.29 46.87 21898.03

变化的输入长度（tp = 4，qps = 2，输出长度=6）

的 num_rdma_nic 国家的请求持续时间（秒）总输入项目代币总生成量请求吞吐量 (req/s) 输出代币吞吐量 (tok/s) 总吞吐量（tok/s）平均TTFT（毫秒）中位TTFT（毫秒） P99 TTFT（毫秒）平均TPOT（毫秒）中位TPOT（毫秒） P99 TPOT（毫秒）平均ITL（毫秒）中位ITL（毫秒） P99 ITL（毫秒）
1024 2 200 98.77 201995 1200 2.02 12.15 2057.32 195.47 151.55 482.84 22.83 19.27 96.55 22.81 18.12 158.16
2048 2 200 99.22 406707 1200 2.02 12.09 4110.95 723.76 488.67 2941.96 67.25 18.93 632.73 67.20 17.49 1209.54
4096 2 200 117.42 818415 1200 1.70 10.22 6979.90 14616.48 18323.82 23191.04 8042.84 7593.16 19851.11 8040.02 65.43 93511.26
8192 2 200 247.77 1636065 1200 0.81 4.84 6608.10 75783.36 79331.60 147544.42 16961.27 15140.11 39278.98 16958.32 90.01 186151.61
1024 1 200 98.77 201995 1200 2.02 12.15 2057.31 201.77 161.53 473.44 22.13 16.52 96.18 22.11 16.51 190.40
2048 1 200 99.25 406707 1200 2.02 12.09 4109.96 719.43 482.02 3208.13 61.92 17.64 681.26 61.86 16.83 978.90
4096 1 200 111.88 818415 1200 1.79 10.73 7326.16 20362.10 22807.05 31853.55 5915.16 4521.51 18739.12 5913.18 67.03 81600.29
8192 1 200 270.01 1636065 1200 0.74 4.44 6063.79 103355.40 106546.65 172025.11 12894.35 11027.66 35110.13 12892.85 64.84 151774.68
1024 TCP 200 98.81 201995 1200 2.02 12.14 2056.44 203.32 160.83 460.90 21.81 16.96 95.27 21.78 16.91 171.80
2048 TCP 200 99.27 406707 1200 2.01 12.09 4108.98 731.60 484.78 3213.69 68.55 17.88 639.93 68.49 17.33 1257.45
4096 TCP 200 118.37 818415 1200 1.69 10.14 6923.89 23735.69 27101.97 36573.47 6386.62 5102.00 20032.26 6384.71 69.57 92811.27
8192 TCP 200 278.12 1636065 1200 0.72 4.31 5886.95 106873.23 109941.33 179781.64 13360.87 12155.24 36022.96 13359.20 68.01 156716.38
至于最佳实践，我认为在 XpYd 之前准备好没有最佳实践。但如果你想测试月饼轮转引擎，你可以按照指导文档来替换结果。

另外我们同时协调资源，整合一些搭载更多RDMA中断和更高级GPU的机器，官方的基准测试结果也适时发布。

This is a great demonstration to adopt mooncake to current disaggregation implementation. Could you share some benchmark data and best practice here? transfer engine's primary feature like more protocols support, topology-aware path selection would be beneficial in larger scale clusters. I am just curious how mooncake perform in 1P1D simple case or isomorphic environments.

Here are some preview mooncake benchmark results on A10 with up to 2 RDMA NICs. I am currently having some trouble benchmarking PyNcclConnector now. For some unknown reasons, it crashes a lot for inter-node disaggregated scenarios. And I am digging into the lookup_buffer and connector to try to identify the root cause. But I haven't found it. So the benchmark results haven't included the PyNcclConnector yet.

Varying tp (input length = 1024, qps = 2, output length =6)

Setting num_rdma_nic Successful Requests Duration (s) Total Input Tokens Total Generated Tokens Req Throughput (req/s) Output Token Throughput (tok/s) Total Token Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) Median TPOT (ms) P99 TPOT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
tp = 1 2 200 99.47 201995 1200 2.01 12.06 2042.74 1056.76 635.00 4006.59 97.08 26.94 781.91 97.01 14.05 2205.51
tp = 2 2 200 98.98 201995 1200 2.02 12.12 2052.95 314.87 231.20 949.40 25.65 15.56 129.60 25.62 15.48 288.06
tp = 4 2 200 98.76 201995 1200 2.03 12.15 2057.44 198.10 160.03 461.61 23.52 18.93 94.38 23.50 18.01 187.79
tp = 1 1 200 99.44 201995 1200 2.01 12.07 2043.39 1071.12 631.56 4361.02 83.93 26.93 794.75 83.86 14.13 1932.66
tp = 2 1 200 98.96 201995 1200 2.02 12.13 2053.35 335.26 258.30 997.93 28.84 15.56 144.82 28.80 15.42 397.56
tp = 4 1 200 98.78 201995 1200 2.02 12.15 2057.03 201.68 162.85 456.33 22.31 16.74 94.76 22.29 16.73 189.13
tp = 1 TCP 200 99.55 201995 1200 2.01 12.05 2041.13 1414.05 766.23 6035.36 155.01 35.28 1191.24 154.91 14.32 3148.99
tp = 2 TCP 200 98.97 201995 1200 2.02 12.12 2053.03 333.74 251.32 954.63 28.74 15.49 161.24 28.70 15.35 393.52
tp = 4 TCP 200 98.78 201995 1200 2.02 12.15 2056.94 205.37 162.92 463.70 21.54 16.51 94.04 21.51 16.56 170.54

Varying qps (length = 1024, tp = 4, output length =6)

Setting num_rdma_nic Successful Requests Duration (s) Total Input Tokens Total Generated Tokens Req Throughput (req/s) Output Token Throughput (tok/s) Total Token Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) Median TPOT (ms) P99 TPOT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
qps = 2 2 200 98.77 201995 1200 2.02 12.15 2057.33 200.64 156.62 478.22 22.63 17.35 99.61 22.60 17.08 186.25
qps = 4 2 200 49.75 201995 1200 4.02 24.12 4084.03 341.88 240.68 1430.54 38.36 18.39 313.45 38.31 17.17 588.80
qps = 6 2 200 33.44 201995 1200 5.98 35.88 6075.54 851.15 501.59 3239.89 102.51 47.67 606.77 102.34 18.35 1704.79
qps = 8 2 200 27.16 201995 1200 7.36 44.19 7482.52 4835.08 5733.45 8846.27 1276.59 1150.11 4401.23 1274.43 48.34 20682.35
qps = 2 1 200 98.77 201995 1200 2.02 12.15 2057.31 201.77 161.53 473.44 22.13 16.52 96.18 22.11 16.51 190.40
qps = 4 1 200 49.76 201995 1200 4.02 24.12 4083.83 337.31 243.38 1395.85 39.95 17.61 325.39 39.88 17.06 838.68
qps = 6 1 200 33.44 201995 1200 5.98 35.88 6075.99 820.53 458.84 3169.52 83.92 30.50 663.07 83.78 17.85 1306.32
qps = 8 1 200 27.19 201995 1200 7.36 44.14 7473.44 5291.91 6160.55 9596.56 1190.36 1040.63 4418.66 1188.33 47.61 20815.23
qps = 2 TCP 200 98.76 201995 1200 2.03 12.15 2057.42 207.22 160.81 511.01 22.17 16.59 94.96 22.15 16.59 181.82
qps = 4 TCP 200 49.79 201995 1200 4.02 24.10 4081.06 355.43 252.63 1554.91 40.15 16.92 314.28 40.09 16.66 708.50
qps = 6 TCP 200 33.49 201995 1200 5.97 35.83 6067.71 907.74 514.85 3253.93 122.75 45.51 648.40 122.56 18.09 2282.92
qps = 8 TCP 200 28.39 201995 1200 7.04 42.26 7156.09 6714.57 7885.09 11787.51 1116.06 408.32 4645.25 1114.29 46.87 21898.03

Varying input length (tp = 4, qps = 2, output length =6)

Setting num_rdma_nic Successful Requests Duration (s) Total Input Tokens Total Generated Tokens Req Throughput (req/s) Output Token Throughput (tok/s) Total Token Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) Median TPOT (ms) P99 TPOT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
1024 2 200 98.77 201995 1200 2.02 12.15 2057.32 195.47 151.55 482.84 22.83 19.27 96.55 22.81 18.12 158.16
2048 2 200 99.22 406707 1200 2.02 12.09 4110.95 723.76 488.67 2941.96 67.25 18.93 632.73 67.20 17.49 1209.54
4096 2 200 117.42 818415 1200 1.70 10.22 6979.90 14616.48 18323.82 23191.04 8042.84 7593.16 19851.11 8040.02 65.43 93511.26
8192 2 200 247.77 1636065 1200 0.81 4.84 6608.10 75783.36 79331.60 147544.42 16961.27 15140.11 39278.98 16958.32 90.01 186151.61
1024 1 200 98.77 201995 1200 2.02 12.15 2057.31 201.77 161.53 473.44 22.13 16.52 96.18 22.11 16.51 190.40
2048 1 200 99.25 406707 1200 2.02 12.09 4109.96 719.43 482.02 3208.13 61.92 17.64 681.26 61.86 16.83 978.90
4096 1 200 111.88 818415 1200 1.79 10.73 7326.16 20362.10 22807.05 31853.55 5915.16 4521.51 18739.12 5913.18 67.03 81600.29
8192 1 200 270.01 1636065 1200 0.74 4.44 6063.79 103355.40 106546.65 172025.11 12894.35 11027.66 35110.13 12892.85 64.84 151774.68
1024 TCP 200 98.81 201995 1200 2.02 12.14 2056.44 203.32 160.83 460.90 21.81 16.96 95.27 21.78 16.91 171.80
2048 TCP 200 99.27 406707 1200 2.01 12.09 4108.98 731.60 484.78 3213.69 68.55 17.88 639.93 68.49 17.33 1257.45
4096 TCP 200 118.37 818415 1200 1.69 10.14 6923.89 23735.69 27101.97 36573.47 6386.62 5102.00 20032.26 6384.71 69.57 92811.27
8192 TCP 200 278.12 1636065 1200 0.72 4.31 5886.95 106873.23 109941.33 179781.64 13360.87 12155.24 36022.96 13359.20 68.01 156716.38
For best practice, I believe there is no best practice before XpYd is ready. But if you want to test the mooncake transfer engine, you can follow the guidance doc to reproduce the results.

In addition, we are also coordinating resources to integrate some machines with more RDMA NICs and more advanced GPUs. The official benchmark results will be released in due time.

Hello, could you please provide the PCIe bus bandwidth of the benchmark machine? This will help us make a horizontal comparison. Currently, the machine I am using is configured with 8*A100-SXM4-80GB GPUs, and the bus bandwidth for each slot is Speed 16GT/s, Width x16.

stmatengss · 2024-12-18T12:37:12Z

@liweiqing1997 Hello. Here we use 8x GA102GL [A10] , the PCIe Phylink Speed is 16GT/s (x16).

liweiqing1997 · 2024-12-19T11:43:11Z

@liweiqing1997 Hello. Here we use 8x GA102GL [A10] , the PCIe Phylink Speed is 16GT/s (x16).

Hello, may I ask if you're using GPUDirect RDMA? I want to know what the transfer path for KvCache is . is it VRAM->DRAM->DRAM->VRAM? or VRAM->VRAM?

ShangmingCai · 2024-12-19T12:38:14Z

@liweiqing1997 Hello. Here we use 8x GA102GL [A10], the PCIe Phylink Speed is 16GT/s (x16).

Hello, may I ask if you're using GPUDirect RDMA? I want to know what the transfer path for KvCache is . is it VRAM->DRAM->DRAM->VRAM? or VRAM->VRAM?

@liweiqing1997
Currently, it is "VRAM->DRAM->DRAM->VRAM." Mooncake supports GPUDirect though. We still have some work to do to implement zero-copy, it is on our roadmap.

liuyumoye · 2024-12-26T08:28:22Z

Hello, I try MooncakeConnector in vllm, choose {"protocol": "tcp"} and {"kv_buffer_device":"cpu"}. When prefill node need to send tensor, I move the tensor from the gpu to the cpu. But when the input_token of request is greater than 800, there is an error on the decode node:
ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/worker/model_runner.py", line 1668, in execute_model ERROR 12-26 00:15:20 engine.py:135] get_kv_transfer_group().recv_kv_caches_and_hidden_states( ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 74, in recv_kv_caches_and_hidden_states
ERROR 12-26 00:15:20 engine.py:135] return self.connector.recv_kv_caches_and_hidden_states(
ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 236, in recv_kv_caches_and_hidden_states
ERROR 12-26 00:15:20 engine.py:135] ret = self.select(current_tokens, ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 137, in select
ERROR 12-26 00:15:20 engine.py:135] return self.consumer_buffer.drop_select(input_tokens, roi)
ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py", line 207, in drop_select
ERROR 12-26 00:15:20 engine.py:135] key = self.data_pipe.recv_tensor().to(ori_input_device) ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py", line 268, in recv_tensor ERROR 12-26 00:15:20 engine.py:135] tensor = self.transport_thread.submit(self._recv_impl).result() ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
ERROR 12-26 00:15:20 engine.py:135] return self.__get_result()
ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result ERROR 12-26 00:15:20 engine.py:135] raise self._exception
ERROR 12-26 00:15:20 engine.py:135] File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
ERROR 12-26 00:15:20 engine.py:135] result = self.fn(*self.args, **self.kwargs)
ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py", line 254, in _recv_impl
ERROR 12-26 00:15:20 engine.py:135] return pickle.loads(data)
ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] _pickle.UnpicklingError: invalid load key, '\xb9'.
ERROR 12-26 00:15:20 engine.py:135]
Do you happen to know what the reason might be?

It seems that the error is in the place where the k cache is received. I compare the received src pointer with the data length. and they are the same as the prefill node transmits. I do not know whether it is because the amount of k cache data transmitted by tcp is too large, and my network speed is relatively slow, resulting in the loss of data transmission.

ShangmingCai · 2024-12-26T18:07:29Z

Hello, I try MooncakeConnector in vllm, choose {"protocol": "tcp"} and {"kv_buffer_device":"cpu"}. When prefill node need to send tensor, I move the tensor from the gpu to the cpu. But when the input_token of request is greater than 800, there is an error on the decode node: ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/worker/model_runner.py", line 1668, in execute_model ERROR 12-26 00:15:20 engine.py:135] get_kv_transfer_group().recv_kv_caches_and_hidden_states( ERROR 12-26 00:15:20 engine.py:135] File

@liuyumoye You should not set kv_buffer_device to "cpu", currently kv_buffer_device only supports "cuda".

liuyumoye · 2024-12-27T01:57:51Z

@ShangmingCai Sorry, I don't understand why kv_buffer_device only supports "cuda".
I noticed that both the transmit and receive kv cache call mva.mooncake_vllm_adaptor.allocateManagedBuffer to open up space on the cpu, and then read the serialized kv cache through pickle.dumps into the buffer, so this transmission chain is "tensor in gpu"-> "cpu buffer" -> "cpu buffer" -> "tensor in gpu". If the initial tensor is on the cpu, there should be no impact on the transfer chain. Am I understanding this correctly？

By the way, I am also curious, if the initial tensor is on the gpu, where is the VRAM->DRAM and DRAM->VRAM in the transfer chain implemented?

ShangmingCai · 2024-12-27T02:40:52Z

@ShangmingCai Sorry, I don't understand why kv_buffer_device only supports "cuda". I noticed that both the transmit and receive kv cache call mva.mooncake_vllm_adaptor.allocateManagedBuffer to open up space on the cpu, and then read the serialized kv cache through pickle.dumps into the buffer, so this transmission chain is "tensor in gpu"-> "cpu buffer" -> "cpu buffer" -> "tensor in GPU".

That is correct. But kv_buffer_device this parameter is used to config the buffer location of vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py, not for mooncake. And the lookup buffer only supports "cuda" now. So the tensor needs to be sent to the Vram eventually for now.

See the code: https://github.com/vllm-project/vllm/blob/81b979f2a8f7ec91c262dac7dcbf30ed577ebafd/vllm/config.py#L2511C1-L2513C45

    # The device used by kv connector to buffer the KV cache.
    # Currently only support 'cuda'.
    kv_buffer_device: Optional[str] = "cuda"

By the way, I am also curious, if the initial tensor is on the gpu, where is the VRAM->DRAM and DRAM->VRAM in the transfer chain implemented?

You can check the implementation of simple_buffer and simple_connector. Mooncake only helps with the data transfer, the other logic has not been touched yet.

liuyumoye · 2024-12-27T03:03:12Z

@ShangmingCai

That is correct. But kv_buffer_device this parameter is used to config the buffer location of vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py, not for mooncake. And the lookup buffer only supports "cuda" now. So the tensor needs to be sent to the Vram eventually for now.

I think this comment is because vllm uses pynccl for transmitting kv cache, but this is not the case for mooncake.

You can check the implementation of simple_buffer and simple_connector. Mooncake only helps with the data transfer, the other logic has not been touched yet.

There is no transfer from VRAM to DRAM in simple_buffer and simple_connector implementation. So won't Mooncake convert from VRAM to DRAM when it transfers VRAM data?

ShangmingCai · 2024-12-27T03:57:07Z

I think this comment is because vllm uses pynccl for transmitting kv cache, but this is not the case for mooncake.

Nope, either pynccl or mooncake, they both need to send data to the kv_buffer in VRAM. That's why we need to set up kv_buffer_size because it is not in the paged kvcache.

ShangmingCai · 2024-12-27T03:59:56Z

There is no transfer from VRAM to DRAM in simple_buffer and simple_connector implementation. So won't Mooncake convert from VRAM to DRAM when it transfers VRAM data?

BTW, tensor objects have device info. If you are not sure where the transfers happen, you can print the device info of tensors, it might help.

ShangmingCai added 7 commits December 2, 2024 19:37

Rebase from main to work with PR 10502.

d52dbc8

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

Update format of mooncake config ValueError.

c8e9d07

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

Modify metadata transfer logic to support tp.

b718f1e

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

Fix format to make ruff happy.

08e2800

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

Add instructions when mooncake is not installed.

8179746

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

Merge branch 'main' into upstream-mooncake-integration

ba82d71

Merge branch 'main' into upstream-mooncake-integration

76d484c

ShangmingCai added 2 commits December 4, 2024 11:06

fix import order to make isort happy.

e912055

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

Fix format to make yapf happy.

2396f01

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

ShangmingCai mentioned this pull request Dec 4, 2024

[Core] Refactoring disaggregated prefilling/decoding using Mooncake Transfer Engine #10728

Closed

youkaichao requested a review from KuntaiDu December 4, 2024 06:55

ShangmingCai added 2 commits December 4, 2024 15:00

Add solution for ports conflict on the same node.

31514a0

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

Fix format to make mypy happy.

2ef10be

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

ShangmingCai added 4 commits December 10, 2024 11:30

Get head_size and num_heads from model config to address bugs on Volt…

0823e47

…a and Turing GPUs. Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

Add support for other metadata server backend.

6fb95fb

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

Change code to align with PR 11058.

a5758b1

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

Fix typo.

33e4455

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

KuntaiDu reviewed Dec 14, 2024

View reviewed changes

ShangmingCai added 5 commits December 15, 2024 12:29

Merge branch 'main' into upstream-mooncake-integration

f3312b9

Reuse simple connector for mooncake pipe.

343c474

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

Remove mooncake connector.

eaa1a45

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

fix isort.

83e4db9

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

fix mypy.

e8ee5c2

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

ShangmingCai added 5 commits December 15, 2024 13:27

move PyNcclPipe import to fix mypy.

bc01eae

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

still trying to fix mypy.

b45ff65

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

fix typo and fix mypy.

aeccf4f

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

trying to fix mypy again.

8c2135a

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

remove unused kvpipe base.

875ca4c

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

ShangmingCai requested a review from KuntaiDu December 15, 2024 07:56

KuntaiDu approved these changes Dec 15, 2024

View reviewed changes

KuntaiDu enabled auto-merge (squash) December 15, 2024 20:00

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 15, 2024

KuntaiDu merged commit d263bd9 into vllm-project:main Dec 15, 2024
69 checks passed

ShangmingCai mentioned this pull request Dec 16, 2024

[Doc] Update the integration state of Mooncake Transfer Engine with vLLM. kvcache-ai/Mooncake#40

Merged

youkaichao mentioned this pull request Dec 20, 2024

[Core] Implement disagg prefill by StatelessProcessGroup #10502

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884

[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884

ShangmingCai commented Dec 4, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 4, 2024

KuntaiDu commented Dec 6, 2024

Jeffwan commented Dec 8, 2024

ShangmingCai commented Dec 9, 2024 •

edited

Loading

junna2016 commented Dec 13, 2024

ShangmingCai commented Dec 13, 2024

junna2016 commented Dec 13, 2024

KuntaiDu left a comment

KuntaiDu Dec 14, 2024

ShangmingCai Dec 15, 2024

KuntaiDu left a comment

liweiqing1997 commented Dec 18, 2024

变化的tp（输入长度 = 1024、qps = 2、输出长度=6）

变化的qps（长度=1024、tp=4、输出长度=6）

变化的输入长度（tp = 4，qps = 2，输出长度=6）

Varying tp (input length = 1024, qps = 2, output length =6)

Varying qps (length = 1024, tp = 4, output length =6)

Varying input length (tp = 4, qps = 2, output length =6)

stmatengss commented Dec 18, 2024

liweiqing1997 commented Dec 19, 2024

ShangmingCai commented Dec 19, 2024

liuyumoye commented Dec 26, 2024

ShangmingCai commented Dec 26, 2024

liuyumoye commented Dec 27, 2024

ShangmingCai commented Dec 27, 2024

liuyumoye commented Dec 27, 2024 •

edited

Loading

ShangmingCai commented Dec 27, 2024

ShangmingCai commented Dec 27, 2024

[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884

[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884

Conversation

ShangmingCai commented Dec 4, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 4, 2024

KuntaiDu commented Dec 6, 2024

Jeffwan commented Dec 8, 2024

ShangmingCai commented Dec 9, 2024 • edited Loading

Varying tp (input length = 1024, qps = 2, output length =6)

Varying qps (length = 1024, tp = 4, output length =6)

Varying input length (tp = 4, qps = 2, output length =6)

junna2016 commented Dec 13, 2024

ShangmingCai commented Dec 13, 2024

junna2016 commented Dec 13, 2024

KuntaiDu left a comment

Choose a reason for hiding this comment

KuntaiDu Dec 14, 2024

Choose a reason for hiding this comment

ShangmingCai Dec 15, 2024

Choose a reason for hiding this comment

KuntaiDu left a comment

Choose a reason for hiding this comment

liweiqing1997 commented Dec 18, 2024

变化的tp（输入长度​​​​ = 1024、qps = 2、输出长度=6）

变化的qps（长度=1024、tp=4、输出长度=6）

变化的输入长度（tp = 4，qps = 2，输出长度=6）

Varying tp (input length = 1024, qps = 2, output length =6)

Varying qps (length = 1024, tp = 4, output length =6)

Varying input length (tp = 4, qps = 2, output length =6)

stmatengss commented Dec 18, 2024

liweiqing1997 commented Dec 19, 2024

ShangmingCai commented Dec 19, 2024

liuyumoye commented Dec 26, 2024

ShangmingCai commented Dec 26, 2024

liuyumoye commented Dec 27, 2024

ShangmingCai commented Dec 27, 2024

liuyumoye commented Dec 27, 2024 • edited Loading

ShangmingCai commented Dec 27, 2024

ShangmingCai commented Dec 27, 2024

ShangmingCai commented Dec 4, 2024 •

edited by github-actions bot

Loading

ShangmingCai commented Dec 9, 2024 •

edited

Loading

变化的tp（输入长度 = 1024、qps = 2、输出长度=6）

liuyumoye commented Dec 27, 2024 •

edited

Loading