Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support disaggregated prefill with Mooncake Transfer Engine #10884

Merged
merged 25 commits into from
Dec 15, 2024

Conversation

ShangmingCai
Copy link
Contributor

@ShangmingCai ShangmingCai commented Dec 4, 2024

We really appreciate @KuntaiDu for his remarkable work in supporting the disaggregated prefill feature in vLLM. Since PR #10502 has been merged. After rebase, we switch the mooncake integration from PR #10728 to here.

This PR is related to #10727, as well as a continuation of PR #10502, which uses Mooncake's Transfer Engine for KVCache transfer instead of NCCL.

Mooncake is a KVCache-centric disaggregated architecture for LLM serving. Transfer Engine is the core component of Mooncake, see documentations for its design & API list.

Compared with NCCL, Mooncake Transfer Engine has the following features:

  • a unified programming interface for data transfers between DRAM-to-DRAM (both local and remote), DRAM-to-GPU VRAM (both local and remote), and DRAM-to-remote NVMe devices
  • support for TCP, RDMA, and NVMe-of protocols
  • topology-aware path selection (link to our English doc, transfer_engine.md), aggregating bandwidth from multiple NICs

Like the current implementation of PR #10502, there are two roles: KV provider (e.g. prefill vLLM instance) and KV consumer (e.g. decode vLLM instance)

  • Provider side implements insert: insert a KV cache into a buffer, so that it can be transferred upon request
  • Consumer side implements drop_select: select a KV cache based on tokens, transfer the selected KV, and drop this KV out from the buffer

Both roles are run on different machines.

Integration guide: https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm-integration-v0.2.md

Benchmark result: https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm-benchmark-results-v0.2.md
More benchmark results will be added in the future.

Test files will be added to align with the future test CI pipeline for PR #10502.

CC List.
@KuntaiDu @youkaichao @alogfans @stmatengss @james0zan

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Copy link

github-actions bot commented Dec 4, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
@KuntaiDu
Copy link
Collaborator

KuntaiDu commented Dec 6, 2024

Now working on OSDI submission, will review after Dec 10.

@Jeffwan
Copy link
Contributor

Jeffwan commented Dec 8, 2024

This is a great demonstration to adopt mooncake to current disaggregation implementation. Could you share some benchmark data and best practice here? transfer engine's primary feature like more protocols support, topology-aware path selection would be beneficial in larger scale clusters. I am just curious how mooncake perform in 1P1D simple case or isomorphic environments.

@ShangmingCai
Copy link
Contributor Author

ShangmingCai commented Dec 9, 2024

This is a great demonstration to adopt mooncake to current disaggregation implementation. Could you share some benchmark data and best practice here? transfer engine's primary feature like more protocols support, topology-aware path selection would be beneficial in larger scale clusters. I am just curious how mooncake perform in 1P1D simple case or isomorphic environments.

Here are some preview mooncake benchmark results on A10 with up to 2 RDMA NICs. I am currently having some trouble benchmarking PyNcclConnector now. For some unknown reasons, it crashes a lot for inter-node disaggregated scenarios. And I am digging into the lookup_buffer and connector to try to identify the root cause. But I haven't found it. So the benchmark results haven't included the PyNcclConnector yet.

Varying tp (input length = 1024, qps = 2, output length =6)

Setting num_rdma_nic Successful Requests Duration (s) Total Input Tokens Total Generated Tokens Req Throughput (req/s) Output Token Throughput (tok/s) Total Token Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) Median TPOT (ms) P99 TPOT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
tp = 1 2 200 99.47 201995 1200 2.01 12.06 2042.74 1056.76 635.00 4006.59 97.08 26.94 781.91 97.01 14.05 2205.51
tp = 2 2 200 98.98 201995 1200 2.02 12.12 2052.95 314.87 231.20 949.40 25.65 15.56 129.60 25.62 15.48 288.06
tp = 4 2 200 98.76 201995 1200 2.03 12.15 2057.44 198.10 160.03 461.61 23.52 18.93 94.38 23.50 18.01 187.79
tp = 1 1 200 99.44 201995 1200 2.01 12.07 2043.39 1071.12 631.56 4361.02 83.93 26.93 794.75 83.86 14.13 1932.66
tp = 2 1 200 98.96 201995 1200 2.02 12.13 2053.35 335.26 258.30 997.93 28.84 15.56 144.82 28.80 15.42 397.56
tp = 4 1 200 98.78 201995 1200 2.02 12.15 2057.03 201.68 162.85 456.33 22.31 16.74 94.76 22.29 16.73 189.13
tp = 1 TCP 200 99.55 201995 1200 2.01 12.05 2041.13 1414.05 766.23 6035.36 155.01 35.28 1191.24 154.91 14.32 3148.99
tp = 2 TCP 200 98.97 201995 1200 2.02 12.12 2053.03 333.74 251.32 954.63 28.74 15.49 161.24 28.70 15.35 393.52
tp = 4 TCP 200 98.78 201995 1200 2.02 12.15 2056.94 205.37 162.92 463.70 21.54 16.51 94.04 21.51 16.56 170.54

Varying qps (length = 1024, tp = 4, output length =6)

Setting num_rdma_nic Successful Requests Duration (s) Total Input Tokens Total Generated Tokens Req Throughput (req/s) Output Token Throughput (tok/s) Total Token Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) Median TPOT (ms) P99 TPOT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
qps = 2 2 200 98.77 201995 1200 2.02 12.15 2057.33 200.64 156.62 478.22 22.63 17.35 99.61 22.60 17.08 186.25
qps = 4 2 200 49.75 201995 1200 4.02 24.12 4084.03 341.88 240.68 1430.54 38.36 18.39 313.45 38.31 17.17 588.80
qps = 6 2 200 33.44 201995 1200 5.98 35.88 6075.54 851.15 501.59 3239.89 102.51 47.67 606.77 102.34 18.35 1704.79
qps = 8 2 200 27.16 201995 1200 7.36 44.19 7482.52 4835.08 5733.45 8846.27 1276.59 1150.11 4401.23 1274.43 48.34 20682.35
qps = 2 1 200 98.77 201995 1200 2.02 12.15 2057.31 201.77 161.53 473.44 22.13 16.52 96.18 22.11 16.51 190.40
qps = 4 1 200 49.76 201995 1200 4.02 24.12 4083.83 337.31 243.38 1395.85 39.95 17.61 325.39 39.88 17.06 838.68
qps = 6 1 200 33.44 201995 1200 5.98 35.88 6075.99 820.53 458.84 3169.52 83.92 30.50 663.07 83.78 17.85 1306.32
qps = 8 1 200 27.19 201995 1200 7.36 44.14 7473.44 5291.91 6160.55 9596.56 1190.36 1040.63 4418.66 1188.33 47.61 20815.23
qps = 2 TCP 200 98.76 201995 1200 2.03 12.15 2057.42 207.22 160.81 511.01 22.17 16.59 94.96 22.15 16.59 181.82
qps = 4 TCP 200 49.79 201995 1200 4.02 24.10 4081.06 355.43 252.63 1554.91 40.15 16.92 314.28 40.09 16.66 708.50
qps = 6 TCP 200 33.49 201995 1200 5.97 35.83 6067.71 907.74 514.85 3253.93 122.75 45.51 648.40 122.56 18.09 2282.92
qps = 8 TCP 200 28.39 201995 1200 7.04 42.26 7156.09 6714.57 7885.09 11787.51 1116.06 408.32 4645.25 1114.29 46.87 21898.03

Varying input length (tp = 4, qps = 2, output length =6)

Setting num_rdma_nic Successful Requests Duration (s) Total Input Tokens Total Generated Tokens Req Throughput (req/s) Output Token Throughput (tok/s) Total Token Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) Median TPOT (ms) P99 TPOT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
1024 2 200 98.77 201995 1200 2.02 12.15 2057.32 195.47 151.55 482.84 22.83 19.27 96.55 22.81 18.12 158.16
2048 2 200 99.22 406707 1200 2.02 12.09 4110.95 723.76 488.67 2941.96 67.25 18.93 632.73 67.20 17.49 1209.54
4096 2 200 117.42 818415 1200 1.70 10.22 6979.90 14616.48 18323.82 23191.04 8042.84 7593.16 19851.11 8040.02 65.43 93511.26
8192 2 200 247.77 1636065 1200 0.81 4.84 6608.10 75783.36 79331.60 147544.42 16961.27 15140.11 39278.98 16958.32 90.01 186151.61
1024 1 200 98.77 201995 1200 2.02 12.15 2057.31 201.77 161.53 473.44 22.13 16.52 96.18 22.11 16.51 190.40
2048 1 200 99.25 406707 1200 2.02 12.09 4109.96 719.43 482.02 3208.13 61.92 17.64 681.26 61.86 16.83 978.90
4096 1 200 111.88 818415 1200 1.79 10.73 7326.16 20362.10 22807.05 31853.55 5915.16 4521.51 18739.12 5913.18 67.03 81600.29
8192 1 200 270.01 1636065 1200 0.74 4.44 6063.79 103355.40 106546.65 172025.11 12894.35 11027.66 35110.13 12892.85 64.84 151774.68
1024 TCP 200 98.81 201995 1200 2.02 12.14 2056.44 203.32 160.83 460.90 21.81 16.96 95.27 21.78 16.91 171.80
2048 TCP 200 99.27 406707 1200 2.01 12.09 4108.98 731.60 484.78 3213.69 68.55 17.88 639.93 68.49 17.33 1257.45
4096 TCP 200 118.37 818415 1200 1.69 10.14 6923.89 23735.69 27101.97 36573.47 6386.62 5102.00 20032.26 6384.71 69.57 92811.27
8192 TCP 200 278.12 1636065 1200 0.72 4.31 5886.95 106873.23 109941.33 179781.64 13360.87 12155.24 36022.96 13359.20 68.01 156716.38

For best practice, I believe there is no best practice before XpYd is ready. But if you want to test the mooncake transfer engine, you can follow the guidance doc to reproduce the results.

In addition, we are also coordinating resources to integrate some machines with more RDMA NICs and more advanced GPUs. The official benchmark results will be released in due time.

…a and Turing GPUs.

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
@junna2016
Copy link

Use latest mooncake code, when I test with tp=1, num_rdma_nic=2, qps=2, input_len=200, output_len=100 in a single machine, which prefill instance num is 1 and decode instance num also is 1.
My mooncake_config.json is shown as below:

{
"prefill_url": "127.0.0.1:8144",
"decode_url": "127.0.0.1:8149",
"metadata_server": "127.0.0.1:2333",
"metadata_backend": "etcd",
"protocol": "rdma",
"device_name": "mlx5_0,mlx5_1"
}

There will occur an error in transfer_engine:

E1213 02:57:10.528410 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:14.286239 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded
E1213 02:57:18.044381 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:21.802461 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded

And with one rdma device(mlx5_0 or mlx5_1) is ok

@ShangmingCai
Copy link
Contributor Author

@junna2016 I think the errors are reported from the underlying transfer engine, please open an issue in the Mooncake repo and we will get someone to help identify the root cause.

@junna2016
Copy link

@junna2016 I think the errors are reported from the underlying transfer engine, please open an issue in the Mooncake repo and we will get someone to help identify the root cause.

kvcache-ai/Mooncake#35

Copy link
Collaborator

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean implementation! I left some comments PTAL.

if self.transport_thread is None:
self.transport_thread = ThreadPoolExecutor(max_workers=1)
tensor = self.transport_thread.submit(self._recv_impl).result()
if tensor.numel() == 1 and tensor.item() == NONE_INT:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is a bit tricky --- tensor.item() can return wrong value when I use vLLM's PyNccl to transmit tensor but print(tensor) can return the right value. You don't need to change any code here if you don't see any bug when stress testing this code (I have a stress test code in tests/kv_transfer/test_send_recv.sh you can try to use that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, MooncakePipe has not encountered similar problems during stress testing and benchmarking. However, will keep this part in mind and make corresponding modifications and tests as PyNcclPipe changes.

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Copy link
Collaborator

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for your contribution!

@KuntaiDu KuntaiDu enabled auto-merge (squash) December 15, 2024 20:00
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 15, 2024
@KuntaiDu KuntaiDu merged commit d263bd9 into vllm-project:main Dec 15, 2024
69 checks passed
@liweiqing1997
Copy link

这将 Mooncake 实现目前的一个很好的例子。您可以分享一些基准数据和最佳实践吗?传输引擎的主要功能(如更多协议支持、拓扑拓扑路径选择)在更大规模的我只是好奇月饼在1P1D简单情况或同构环境中的表现如何。

以下是在最多配备 2 个 RDMA NIC 的 A10 上进行的一些月饼基准测试结果。我现在在基准测试中遇到了麻烦PyNcclConnector。由于某些未知原因,它在节点间分割场景中经常崩溃。我正在研究研究lookup_bufferconnector试图找出根本原因。但我还没有找到它。所以基准测试结果PyNcclConnector还不包括。

变化的tp(输入长度​​​​ = 1024、qps = 2、输出长度=6)

的 num_rdma_nic 国家的请求 持续时间(秒) 总输入项目 代币总生成量 请求吞吐量 (req/s) 输出代币吞吐量 (tok/s) 总吞吐量(tok/s) 平均TTFT(毫秒) 中位TTFT(毫秒) P99 TTFT(毫秒) 平均TPOT(毫秒) 中位TPOT(毫秒) P99 TPOT(毫秒) 平均ITL(毫秒) 中位ITL(毫秒) P99 ITL(毫秒)
tp = 1 2 200 99.47 201995 1200 2.01 12.06 2042.74 1056.76 635.00 4006.59 97.08 26.94 781.91 97.01 14.05 2205.51
tp = 2 2 200 98.98 201995 1200 2.02 12.12 2052.95 314.87 231.20 949.40 25.65 15.56 129.60 25.62 15.48 288.06
tp = 4 2 200 98.76 201995 1200 2.03 12.15 2057.44 198.10 160.03 461.61 23.52 18.93 94.38 23.50 18.01 187.79
tp = 1 1 200 99.44 201995 1200 2.01 12.07 2043.39 1071.12 631.56 4361.02 83.93 26.93 794.75 83.86 14.13 1932.66
tp = 2 1 200 98.96 201995 1200 2.02 12.13 2053.35 335.26 258.30 997.93 28.84 15.56 144.82 28.80 15.42 397.56
tp = 4 1 200 98.78 201995 1200 2.02 12.15 2057.03 201.68 162.85 456.33 22.31 16.74 94.76 22.29 16.73 189.13
tp = 1 TCP 200 99.55 201995 1200 2.01 12.05 2041.13 1414.05 766.23 6035.36 155.01 35.28 1191.24 154.91 14.32 3148.99
tp = 2 TCP 200 98.97 201995 1200 2.02 12.12 2053.03 333.74 251.32 954.63 28.74 15.49 161.24 28.70 15.35 393.52
tp = 4 TCP 200 98.78 201995 1200 2.02 12.15 2056.94 205.37 162.92 463.70 21.54 16.51 94.04 21.51 16.56 170.54

变化的qps(长度=1024、tp=4、输出长度=6)

的 num_rdma_nic 国家的请求 持续时间(秒) 总输入项目 代币总生成量 请求吞吐量 (req/s) 输出代币吞吐量 (tok/s) 总吞吐量(tok/s) 平均TTFT(毫秒) 中位TTFT(毫秒) P99 TTFT(毫秒) 平均TPOT(毫秒) 中位TPOT(毫秒) P99 TPOT(毫秒) 平均ITL(毫秒) 中位ITL(毫秒) P99 ITL(毫秒)
每秒查询次数 = 2 2 200 98.77 201995 1200 2.02 12.15 2057.33 200.64 156.62 478.22 22.63 17.35 99.61 22.60 17.08 186.25
每秒查询次数 = 4 2 200 49.75 201995 1200 4.02 24.12 4084.03 341.88 240.68 1430.54 38.36 18.39 313.45 38.31 17.17 588.80
每週查數量 = 6 2 200 33.44 201995 1200 5.98 35.88 6075.54 851.15 501.59 3239.89 102.51 47.67 606.77 102.34 18.35 1704.79
每秒查询速度 = 8 2 200 27.16 201995 1200 7.36 44.19 7482.52 4835.08 5733.45 8846.27 1276.59 1150.11 4401.23 1274.43 48.34 20682.35
每秒查询次数 = 2 1 200 98.77 201995 1200 2.02 12.15 2057.31 201.77 161.53 473.44 22.13 16.52 96.18 22.11 16.51 190.40
每秒查询次数 = 4 1 200 49.76 201995 1200 4.02 24.12 4083.83 337.31 243.38 1395.85 39.95 17.61 325.39 39.88 17.06 838.68
每週查數量 = 6 1 200 33.44 201995 1200 5.98 35.88 6075.99 820.53 458.84 3169.52 83.92 30.50 663.07 83.78 17.85 1306.32
每秒查询速度 = 8 1 200 27.19 201995 1200 7.36 44.14 7473.44 5291.91 6160.55 9596.56 1190.36 1040.63 4418.66 1188.33 47.61 20815.23
每秒查询次数 = 2 TCP 200 98.76 201995 1200 2.03 12.15 2057.42 207.22 160.81 511.01 22.17 16.59 94.96 22.15 16.59 181.82
每秒查询次数 = 4 TCP 200 49.79 201995 1200 4.02 24.10 4081.06 355.43 252.63 1554.91 40.15 16.92 314.28 40.09 16.66 708.50
每週查數量 = 6 TCP 200 33.49 201995 1200 5.97 35.83 6067.71 907.74 514.85 3253.93 122.75 45.51 648.40 122.56 18.09 2282.92
每秒查询速度 = 8 TCP 200 28.39 201995 1200 7.04 42.26 7156.09 6714.57 7885.09 11787.51 1116.06 408.32 4645.25 1114.29 46.87 21898.03

变化的输入长度(tp = 4,qps = 2,输出长度=6)

的 num_rdma_nic 国家的请求 持续时间(秒) 总输入项目 代币总生成量 请求吞吐量 (req/s) 输出代币吞吐量 (tok/s) 总吞吐量(tok/s) 平均TTFT(毫秒) 中位TTFT(毫秒) P99 TTFT(毫秒) 平均TPOT(毫秒) 中位TPOT(毫秒) P99 TPOT(毫秒) 平均ITL(毫秒) 中位ITL(毫秒) P99 ITL(毫秒)
1024 2 200 98.77 201995 1200 2.02 12.15 2057.32 195.47 151.55 482.84 22.83 19.27 96.55 22.81 18.12 158.16
2048 2 200 99.22 406707 1200 2.02 12.09 4110.95 723.76 488.67 2941.96 67.25 18.93 632.73 67.20 17.49 1209.54
4096 2 200 117.42 818415 1200 1.70 10.22 6979.90 14616.48 18323.82 23191.04 8042.84 7593.16 19851.11 8040.02 65.43 93511.26
8192 2 200 247.77 1636065 1200 0.81 4.84 6608.10 75783.36 79331.60 147544.42 16961.27 15140.11 39278.98 16958.32 90.01 186151.61
1024 1 200 98.77 201995 1200 2.02 12.15 2057.31 201.77 161.53 473.44 22.13 16.52 96.18 22.11 16.51 190.40
2048 1 200 99.25 406707 1200 2.02 12.09 4109.96 719.43 482.02 3208.13 61.92 17.64 681.26 61.86 16.83 978.90
4096 1 200 111.88 818415 1200 1.79 10.73 7326.16 20362.10 22807.05 31853.55 5915.16 4521.51 18739.12 5913.18 67.03 81600.29
8192 1 200 270.01 1636065 1200 0.74 4.44 6063.79 103355.40 106546.65 172025.11 12894.35 11027.66 35110.13 12892.85 64.84 151774.68
1024 TCP 200 98.81 201995 1200 2.02 12.14 2056.44 203.32 160.83 460.90 21.81 16.96 95.27 21.78 16.91 171.80
2048 TCP 200 99.27 406707 1200 2.01 12.09 4108.98 731.60 484.78 3213.69 68.55 17.88 639.93 68.49 17.33 1257.45
4096 TCP 200 118.37 818415 1200 1.69 10.14 6923.89 23735.69 27101.97 36573.47 6386.62 5102.00 20032.26 6384.71 69.57 92811.27
8192 TCP 200 278.12 1636065 1200 0.72 4.31 5886.95 106873.23 109941.33 179781.64 13360.87 12155.24 36022.96 13359.20 68.01 156716.38
至于最佳实践,我认为在 XpYd 之前准备好没有最佳实践。但如果你想测试月饼轮转引擎,你可以按照指导文档来替换结果。

另外我们同时协调资源,整合一些搭载更多RDMA中断和更高级GPU的机器,官方的基准测试结果也适时发布。

This is a great demonstration to adopt mooncake to current disaggregation implementation. Could you share some benchmark data and best practice here? transfer engine's primary feature like more protocols support, topology-aware path selection would be beneficial in larger scale clusters. I am just curious how mooncake perform in 1P1D simple case or isomorphic environments.

Here are some preview mooncake benchmark results on A10 with up to 2 RDMA NICs. I am currently having some trouble benchmarking PyNcclConnector now. For some unknown reasons, it crashes a lot for inter-node disaggregated scenarios. And I am digging into the lookup_buffer and connector to try to identify the root cause. But I haven't found it. So the benchmark results haven't included the PyNcclConnector yet.

Varying tp (input length = 1024, qps = 2, output length =6)

Setting num_rdma_nic Successful Requests Duration (s) Total Input Tokens Total Generated Tokens Req Throughput (req/s) Output Token Throughput (tok/s) Total Token Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) Median TPOT (ms) P99 TPOT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
tp = 1 2 200 99.47 201995 1200 2.01 12.06 2042.74 1056.76 635.00 4006.59 97.08 26.94 781.91 97.01 14.05 2205.51
tp = 2 2 200 98.98 201995 1200 2.02 12.12 2052.95 314.87 231.20 949.40 25.65 15.56 129.60 25.62 15.48 288.06
tp = 4 2 200 98.76 201995 1200 2.03 12.15 2057.44 198.10 160.03 461.61 23.52 18.93 94.38 23.50 18.01 187.79
tp = 1 1 200 99.44 201995 1200 2.01 12.07 2043.39 1071.12 631.56 4361.02 83.93 26.93 794.75 83.86 14.13 1932.66
tp = 2 1 200 98.96 201995 1200 2.02 12.13 2053.35 335.26 258.30 997.93 28.84 15.56 144.82 28.80 15.42 397.56
tp = 4 1 200 98.78 201995 1200 2.02 12.15 2057.03 201.68 162.85 456.33 22.31 16.74 94.76 22.29 16.73 189.13
tp = 1 TCP 200 99.55 201995 1200 2.01 12.05 2041.13 1414.05 766.23 6035.36 155.01 35.28 1191.24 154.91 14.32 3148.99
tp = 2 TCP 200 98.97 201995 1200 2.02 12.12 2053.03 333.74 251.32 954.63 28.74 15.49 161.24 28.70 15.35 393.52
tp = 4 TCP 200 98.78 201995 1200 2.02 12.15 2056.94 205.37 162.92 463.70 21.54 16.51 94.04 21.51 16.56 170.54

Varying qps (length = 1024, tp = 4, output length =6)

Setting num_rdma_nic Successful Requests Duration (s) Total Input Tokens Total Generated Tokens Req Throughput (req/s) Output Token Throughput (tok/s) Total Token Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) Median TPOT (ms) P99 TPOT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
qps = 2 2 200 98.77 201995 1200 2.02 12.15 2057.33 200.64 156.62 478.22 22.63 17.35 99.61 22.60 17.08 186.25
qps = 4 2 200 49.75 201995 1200 4.02 24.12 4084.03 341.88 240.68 1430.54 38.36 18.39 313.45 38.31 17.17 588.80
qps = 6 2 200 33.44 201995 1200 5.98 35.88 6075.54 851.15 501.59 3239.89 102.51 47.67 606.77 102.34 18.35 1704.79
qps = 8 2 200 27.16 201995 1200 7.36 44.19 7482.52 4835.08 5733.45 8846.27 1276.59 1150.11 4401.23 1274.43 48.34 20682.35
qps = 2 1 200 98.77 201995 1200 2.02 12.15 2057.31 201.77 161.53 473.44 22.13 16.52 96.18 22.11 16.51 190.40
qps = 4 1 200 49.76 201995 1200 4.02 24.12 4083.83 337.31 243.38 1395.85 39.95 17.61 325.39 39.88 17.06 838.68
qps = 6 1 200 33.44 201995 1200 5.98 35.88 6075.99 820.53 458.84 3169.52 83.92 30.50 663.07 83.78 17.85 1306.32
qps = 8 1 200 27.19 201995 1200 7.36 44.14 7473.44 5291.91 6160.55 9596.56 1190.36 1040.63 4418.66 1188.33 47.61 20815.23
qps = 2 TCP 200 98.76 201995 1200 2.03 12.15 2057.42 207.22 160.81 511.01 22.17 16.59 94.96 22.15 16.59 181.82
qps = 4 TCP 200 49.79 201995 1200 4.02 24.10 4081.06 355.43 252.63 1554.91 40.15 16.92 314.28 40.09 16.66 708.50
qps = 6 TCP 200 33.49 201995 1200 5.97 35.83 6067.71 907.74 514.85 3253.93 122.75 45.51 648.40 122.56 18.09 2282.92
qps = 8 TCP 200 28.39 201995 1200 7.04 42.26 7156.09 6714.57 7885.09 11787.51 1116.06 408.32 4645.25 1114.29 46.87 21898.03

Varying input length (tp = 4, qps = 2, output length =6)

Setting num_rdma_nic Successful Requests Duration (s) Total Input Tokens Total Generated Tokens Req Throughput (req/s) Output Token Throughput (tok/s) Total Token Throughput (tok/s) Mean TTFT (ms) Median TTFT (ms) P99 TTFT (ms) Mean TPOT (ms) Median TPOT (ms) P99 TPOT (ms) Mean ITL (ms) Median ITL (ms) P99 ITL (ms)
1024 2 200 98.77 201995 1200 2.02 12.15 2057.32 195.47 151.55 482.84 22.83 19.27 96.55 22.81 18.12 158.16
2048 2 200 99.22 406707 1200 2.02 12.09 4110.95 723.76 488.67 2941.96 67.25 18.93 632.73 67.20 17.49 1209.54
4096 2 200 117.42 818415 1200 1.70 10.22 6979.90 14616.48 18323.82 23191.04 8042.84 7593.16 19851.11 8040.02 65.43 93511.26
8192 2 200 247.77 1636065 1200 0.81 4.84 6608.10 75783.36 79331.60 147544.42 16961.27 15140.11 39278.98 16958.32 90.01 186151.61
1024 1 200 98.77 201995 1200 2.02 12.15 2057.31 201.77 161.53 473.44 22.13 16.52 96.18 22.11 16.51 190.40
2048 1 200 99.25 406707 1200 2.02 12.09 4109.96 719.43 482.02 3208.13 61.92 17.64 681.26 61.86 16.83 978.90
4096 1 200 111.88 818415 1200 1.79 10.73 7326.16 20362.10 22807.05 31853.55 5915.16 4521.51 18739.12 5913.18 67.03 81600.29
8192 1 200 270.01 1636065 1200 0.74 4.44 6063.79 103355.40 106546.65 172025.11 12894.35 11027.66 35110.13 12892.85 64.84 151774.68
1024 TCP 200 98.81 201995 1200 2.02 12.14 2056.44 203.32 160.83 460.90 21.81 16.96 95.27 21.78 16.91 171.80
2048 TCP 200 99.27 406707 1200 2.01 12.09 4108.98 731.60 484.78 3213.69 68.55 17.88 639.93 68.49 17.33 1257.45
4096 TCP 200 118.37 818415 1200 1.69 10.14 6923.89 23735.69 27101.97 36573.47 6386.62 5102.00 20032.26 6384.71 69.57 92811.27
8192 TCP 200 278.12 1636065 1200 0.72 4.31 5886.95 106873.23 109941.33 179781.64 13360.87 12155.24 36022.96 13359.20 68.01 156716.38
For best practice, I believe there is no best practice before XpYd is ready. But if you want to test the mooncake transfer engine, you can follow the guidance doc to reproduce the results.

In addition, we are also coordinating resources to integrate some machines with more RDMA NICs and more advanced GPUs. The official benchmark results will be released in due time.

Hello, could you please provide the PCIe bus bandwidth of the benchmark machine? This will help us make a horizontal comparison. Currently, the machine I am using is configured with 8*A100-SXM4-80GB GPUs, and the bus bandwidth for each slot is Speed 16GT/s, Width x16.

@stmatengss
Copy link

@liweiqing1997 Hello. Here we use 8x GA102GL [A10] , the PCIe Phylink Speed is 16GT/s (x16).

@liweiqing1997
Copy link

@liweiqing1997 Hello. Here we use 8x GA102GL [A10] , the PCIe Phylink Speed is 16GT/s (x16).

Hello, may I ask if you're using GPUDirect RDMA? I want to know what the transfer path for KvCache is . is it VRAM->DRAM->DRAM->VRAM? or VRAM->VRAM?

@ShangmingCai
Copy link
Contributor Author

@liweiqing1997 Hello. Here we use 8x GA102GL [A10], the PCIe Phylink Speed is 16GT/s (x16).

Hello, may I ask if you're using GPUDirect RDMA? I want to know what the transfer path for KvCache is . is it VRAM->DRAM->DRAM->VRAM? or VRAM->VRAM?

@liweiqing1997
Currently, it is "VRAM->DRAM->DRAM->VRAM." Mooncake supports GPUDirect though. We still have some work to do to implement zero-copy, it is on our roadmap.

@liuyumoye
Copy link

Hello, I try MooncakeConnector in vllm, choose {"protocol": "tcp"} and {"kv_buffer_device":"cpu"}. When prefill node need to send tensor, I move the tensor from the gpu to the cpu. But when the input_token of request is greater than 800, there is an error on the decode node:
ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/worker/model_runner.py", line 1668, in execute_model ERROR 12-26 00:15:20 engine.py:135] get_kv_transfer_group().recv_kv_caches_and_hidden_states( ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 74, in recv_kv_caches_and_hidden_states
ERROR 12-26 00:15:20 engine.py:135] return self.connector.recv_kv_caches_and_hidden_states(
ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 236, in recv_kv_caches_and_hidden_states
ERROR 12-26 00:15:20 engine.py:135] ret = self.select(current_tokens, ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/simple_connector.py", line 137, in select
ERROR 12-26 00:15:20 engine.py:135] return self.consumer_buffer.drop_select(input_tokens, roi)
ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py", line 207, in drop_select
ERROR 12-26 00:15:20 engine.py:135] key = self.data_pipe.recv_tensor().to(ori_input_device) ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py", line 268, in recv_tensor ERROR 12-26 00:15:20 engine.py:135] tensor = self.transport_thread.submit(self._recv_impl).result() ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
ERROR 12-26 00:15:20 engine.py:135] return self.__get_result()
ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result ERROR 12-26 00:15:20 engine.py:135] raise self._exception
ERROR 12-26 00:15:20 engine.py:135] File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
ERROR 12-26 00:15:20 engine.py:135] result = self.fn(*self.args, **self.kwargs)
ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_pipe/mooncake_pipe.py", line 254, in _recv_impl
ERROR 12-26 00:15:20 engine.py:135] return pickle.loads(data)
ERROR 12-26 00:15:20 engine.py:135] ^^^^^^^^^^^^^^^^^^
ERROR 12-26 00:15:20 engine.py:135] _pickle.UnpicklingError: invalid load key, '\xb9'.
ERROR 12-26 00:15:20 engine.py:135]
Do you happen to know what the reason might be?

It seems that the error is in the place where the k cache is received. I compare the received src pointer with the data length. and they are the same as the prefill node transmits. I do not know whether it is because the amount of k cache data transmitted by tcp is too large, and my network speed is relatively slow, resulting in the loss of data transmission.

@ShangmingCai
Copy link
Contributor Author

Hello, I try MooncakeConnector in vllm, choose {"protocol": "tcp"} and {"kv_buffer_device":"cpu"}. When prefill node need to send tensor, I move the tensor from the gpu to the cpu. But when the input_token of request is greater than 800, there is an error on the decode node: ERROR 12-26 00:15:20 engine.py:135] File "/workspace/vllm/vllm/worker/model_runner.py", line 1668, in execute_model ERROR 12-26 00:15:20 engine.py:135] get_kv_transfer_group().recv_kv_caches_and_hidden_states( ERROR 12-26 00:15:20 engine.py:135] File

@liuyumoye You should not set kv_buffer_device to "cpu", currently kv_buffer_device only supports "cuda".

@liuyumoye
Copy link

@ShangmingCai Sorry, I don't understand why kv_buffer_device only supports "cuda".
I noticed that both the transmit and receive kv cache call mva.mooncake_vllm_adaptor.allocateManagedBuffer to open up space on the cpu, and then read the serialized kv cache through pickle.dumps into the buffer, so this transmission chain is "tensor in gpu"-> "cpu buffer" -> "cpu buffer" -> "tensor in gpu". If the initial tensor is on the cpu, there should be no impact on the transfer chain. Am I understanding this correctly?

By the way, I am also curious, if the initial tensor is on the gpu, where is the VRAM->DRAM and DRAM->VRAM in the transfer chain implemented?

@ShangmingCai
Copy link
Contributor Author

@ShangmingCai Sorry, I don't understand why kv_buffer_device only supports "cuda". I noticed that both the transmit and receive kv cache call mva.mooncake_vllm_adaptor.allocateManagedBuffer to open up space on the cpu, and then read the serialized kv cache through pickle.dumps into the buffer, so this transmission chain is "tensor in gpu"-> "cpu buffer" -> "cpu buffer" -> "tensor in GPU".

That is correct. But kv_buffer_device this parameter is used to config the buffer location of vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py, not for mooncake. And the lookup buffer only supports "cuda" now. So the tensor needs to be sent to the Vram eventually for now.

See the code: https://github.com/vllm-project/vllm/blob/81b979f2a8f7ec91c262dac7dcbf30ed577ebafd/vllm/config.py#L2511C1-L2513C45

    # The device used by kv connector to buffer the KV cache.
    # Currently only support 'cuda'.
    kv_buffer_device: Optional[str] = "cuda"

By the way, I am also curious, if the initial tensor is on the gpu, where is the VRAM->DRAM and DRAM->VRAM in the transfer chain implemented?

You can check the implementation of simple_buffer and simple_connector. Mooncake only helps with the data transfer, the other logic has not been touched yet.

@liuyumoye
Copy link

liuyumoye commented Dec 27, 2024

@ShangmingCai

That is correct. But kv_buffer_device this parameter is used to config the buffer location of vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py, not for mooncake. And the lookup buffer only supports "cuda" now. So the tensor needs to be sent to the Vram eventually for now.

I think this comment is because vllm uses pynccl for transmitting kv cache, but this is not the case for mooncake.

You can check the implementation of simple_buffer and simple_connector. Mooncake only helps with the data transfer, the other logic has not been touched yet.

There is no transfer from VRAM to DRAM in simple_buffer and simple_connector implementation. So won't Mooncake convert from VRAM to DRAM when it transfers VRAM data?

@ShangmingCai
Copy link
Contributor Author

I think this comment is because vllm uses pynccl for transmitting kv cache, but this is not the case for mooncake.

Nope, either pynccl or mooncake, they both need to send data to the kv_buffer in VRAM. That's why we need to set up kv_buffer_size because it is not in the paged kvcache.

@ShangmingCai
Copy link
Contributor Author

There is no transfer from VRAM to DRAM in simple_buffer and simple_connector implementation. So won't Mooncake convert from VRAM to DRAM when it transfers VRAM data?

BTW, tensor objects have device info. If you are not sure where the transfers happen, you can print the device info of tensors, it might help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants