-
Notifications
You must be signed in to change notification settings - Fork 544
[P/D] Mooncake Connector for v1 distributed #1568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f8d44b2 to
7977dac
Compare
Change log07/03: change transfer_sync_read to batch_transfer_sync_read |
|
@LCAIZJ plz make ci happy |
god bless we |
d6a48f0 to
f8fbcf6
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1568 +/- ##
==========================================
+ Coverage 76.35% 77.04% +0.69%
==========================================
Files 117 120 +3
Lines 13371 14788 +1417
==========================================
+ Hits 10209 11393 +1184
- Misses 3162 3395 +233
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@jianzs CI is happy |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
solve conflicts |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: leichao.lc <leichao139636@163.com>
Signed-off-by: leichao.lc <leichao139636@163.com>
Signed-off-by: leichao.lc <leichao139636@163.com>
Signed-off-by: leichao.lc <leichao139636@163.com>
Signed-off-by: leichao.lc <leichao139636@163.com>
|
@wangxiyuan Apologies, the code review comments were lost due to a force-push. Implemented the majority of your suggestions, and Mooncake version specifics need elaboration. |
| "prefill with num_computed_tokens == 0." | ||
| # Assume that the request's KV cache is already fully prefilled and | ||
| # can be fetched entirely from the prefill node. | ||
| count = max(len(request.prompt_token_ids) - 1, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line looks strange, the external computed token should not include the local computed token? Or you just not support prefix cache on the decode instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. For this implementation, we assume two things: decode node without prefix cache enabled, and prefill kv cache will not use block size truncation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? prefix cache is almost some kind of free lunch, it should bring no pain in any scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. While we typically consider prefix cache and KV pool as a coupled solution, enabling GPU prefix cache alone could be a great option for decode nodes. We'll explore this approach.
| if all(p in params for p in ("remote_engine_id", "remote_host", | ||
| "remote_port")): | ||
| local_block_ids = (blocks.get_unhashed_block_ids() | ||
| if num_external_tokens > 0 else []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, is there any chance for disaggregated pd that a remote computed request dose not have num_external_tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct. In our current implementation, decode node without prefix cache enabled, so the scenario where num_external_tokens = 0 does not occur. If remote_blocks and num_external_tokens = 0, we have a full prefix cache hit on the D worker. We need to call send_notif in _read_blocks to free the memory on the P.
| remote_engine_id=self.engine_id, | ||
| remote_host=self.side_channel_host, | ||
| remote_port=self.side_channel_port, | ||
| last_token_id=request.output_token_ids[-1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So mooncake just support returning the first token from prefiller's output? But how dose the decode instance use this attributed? Seems no usage for this attr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for raising this concern. This is on me. We've made internal modifications to the API server where the decoder appends the token ID to the original prompt token IDs upon receiving a request. We need to consider how to properly adapt this in the open-source version. Maybe, we could have the decoder directly return the first token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I think its worth to try
| block_len = (self.block_len[k % 2] | ||
| if self.use_mla else self.block_len[0]) | ||
| for i, remote_block_id in enumerate(grouped_remote_block_ids): | ||
| local_block_ids = grouped_local_block_ids[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you grouped the block ids to make sure you pull blocks in a locality friendly way? How much improvements it gets? Normally, the block ids should be already sequential in most scenario, just wonder if the cpu overhead of contiguous rearange can be actually beaten by its transfer benefits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we haven't measured the specific benefits of block grouping, our experiments show that blocks are often non-sequential. Due to block sizes tending to be modest, it seems that the computational overhead for reordering operations should be negligible?
Signed-off-by: leichao.lc <leichao139636@163.com>
|
@LCAIZJ Hello, thanks for your contribution! Recently I'am trying to run vllm v1 + mooncake with mooncake connector on NPU. I built an image of your PR with vllm-ascend v0.10.1 and Ascend_transport. I tried to run it on K8s and assgined 4 of 8 910B NPU to the pod: Then I entered the container and exported ENV variables I encountered the problem from torch_npu(v2.7.1): |
I also encountered the same issue. The device_id assigned by Kubernetes might start counting from 0 by default. In our connector, we currently read the device_id from the OS environment variables, it may raise error when quering device_ip in hccn_conf file. A possible solution is to add a new environment variable in vLLM (e.g., VLLM_DEVICE_ID) and then retrieve the device_id from vllm.envs. |
Signed-off-by: leichao.lc <leichao139636@163.com>
Thank you for your question. In the Mooncake Connector, we identify NPUs through the ASCEND_RT_VISIBLE_DEVICES parameter. For example, when using NPUs 2, 3, 4, and 5, your Kubernetes YAML should set ASCEND_RT_VISIBLE_DEVICES="2,3,4,5". If this parameter is not configured, we assume the NPU indexing starts from 0 by default. |
|
let's consider to build the mooncake package into docker image as well. @wxsIcey |
|
The CI failed due to other known issue. Let's merge this first. Feel free to add following-up PR if there is any other issue. |
|
@LCAIZJ @zzy-ContiLearn Thank you for your advice! |
We ran some tests in our local environment using vLLM-Ascend v0.9.1 (we did not use v0.10.1 due to potential dependency compatibility concerns) and were unable to reproduce the issue you mentioned. |
|
@LCAIZJ I used v0.9.1 vllm-ascend also. In my condition, I use K8s on a physical machine, and run several vllm (P or D) pods on it. As shown in the first question I commented under this PR, firstly the ENV variable ASCEND_RT_VISIBLE_DEVICES is not compatible with torch_npu. Because for each pod, the torch_npu in the container could only observe the mount NPU device, not all NPU device on the machine. When I try to avoid using ASCEND_RT_VISIBLE_DEVICES by introducing a new ENV variable, I need to do extra operations to get the device for the current DP, as shown in the patch. |
### What this PR does / why we need it? This PR adopt Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. ### Does this PR introduce any user-facing change? No ### Dependencies 1. Cann Dependencies Using Mooncake TransferEngine with Ascend Transport requires CANN version 8.2.RC1 or higher.(see detail Mooncake[vllm-project#502](kvcache-ai/Mooncake#502)) 2. vllm-ascend This PR depends on changes introduced by vllm-project#950 (modifications to `model_runner_v1`) and vllm-project#1361 (updates to `schedule`), both of which have been merged into the `v0.9.1-dev` branch and are expected to land in `main` shortly. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1c859a1 --------- Signed-off-by: leichao.lc <leichao139636@163.com> Co-authored-by: jianzs <zheng.shoujian@outlook.com> Co-authored-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: fems14 <1804143737@qq.com> Co-authored-by: Dreamerleader <2270923832@qq.com> Co-authored-by: chris668899 <15105191595@126.com> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>
### What this PR does / why we need it? This PR adopt Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation. ### Does this PR introduce any user-facing change? No ### Dependencies 1. Cann Dependencies Using Mooncake TransferEngine with Ascend Transport requires CANN version 8.2.RC1 or higher.(see detail Mooncake[vllm-project#502](kvcache-ai/Mooncake#502)) 2. vllm-ascend This PR depends on changes introduced by vllm-project#950 (modifications to `model_runner_v1`) and vllm-project#1361 (updates to `schedule`), both of which have been merged into the `v0.9.1-dev` branch and are expected to land in `main` shortly. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1c859a1 --------- Signed-off-by: leichao.lc <leichao139636@163.com> Co-authored-by: jianzs <zheng.shoujian@outlook.com> Co-authored-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: fems14 <1804143737@qq.com> Co-authored-by: Dreamerleader <2270923832@qq.com> Co-authored-by: chris668899 <15105191595@126.com> Co-authored-by: Pz1116 <zpbzpb123123@gmail.com>




What this PR does / why we need it?
This PR adopt Mooncake TransferEngine for kv cache register and pull_blocks style disaggregate prefill implementation.
Does this PR introduce any user-facing change?
No
Dependencies
Cann Dependencies
Using Mooncake TransferEngine with Ascend Transport requires CANN version 8.2.RC1 or higher.(see detail Mooncake#502)
vllm-ascend
This PR depends on changes introduced by Disaggregate prefill for kv cache register style #950 (modifications to
model_runner_v1) and [V0.9.1][Bugfix] Remove schedulre patch for disaggregated PD #1361 (updates toschedule), both of which have been merged into thev0.9.1-devbranch and are expected to land inmainshortly.How was this patch tested?