-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[bugfix][DCP] fix block_size of hash in DCP prefix caching #26296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes a bug in prefix caching with Decode Context Parallelism (DCP) by correctly calculating the block size for hashing. The change ensures that the block_size is multiplied by decode_context_parallel_size, which is consistent with how the scheduler handles block sizes in DCP. My review includes a suggestion to refactor the code to avoid logic duplication, which was the root cause of this bug, thereby improving maintainability and preventing similar issues in the future.
vllm/v1/engine/core.py
Outdated
| hash_block_size = ( | ||
| vllm_config.cache_config.block_size | ||
| * vllm_config.parallel_config.decode_context_parallel_size | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calculation for hash_block_size duplicates logic from the Scheduler's __init__ method, where its block_size attribute is also adjusted for decode_context_parallel_size. This duplication is what led to the original bug this PR is fixing.
To avoid this, you can reuse the block_size from the scheduler instance. This makes the code more robust by using a single source of truth.
While block_size is not on the SchedulerInterface, using it with a type: ignore is a pragmatic way to remove the duplication within this file. A more complete solution would involve exposing the logical block size through the interface or a shared utility, which could be addressed in a follow-up.
hash_block_size = self.scheduler.block_size # type: ignoreThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
vllm/v1/engine/core.py
Outdated
| self.vllm_config.cache_config.enable_prefix_caching | ||
| or self.scheduler.get_kv_connector() is not None | ||
| ): | ||
| block_size = vllm_config.cache_config.block_size | ||
| hash_block_size = ( | ||
| vllm_config.cache_config.block_size | ||
| * vllm_config.parallel_config.decode_context_parallel_size | ||
| ) | ||
| caching_hash_fn = get_hash_fn_by_name( | ||
| vllm_config.cache_config.prefix_caching_hash_algo | ||
| ) | ||
| init_none_hash(caching_hash_fn) | ||
|
|
||
| self.request_block_hasher = get_request_block_hasher( | ||
| block_size, caching_hash_fn | ||
| hash_block_size, caching_hash_fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid scaling block hashes when only KV connectors use them
The new hash setup multiplies cache_config.block_size by decode_context_parallel_size before computing request.block_hashes. That makes each hash correspond to block_size × dcp_world_size tokens (engine/core.py lines 175‑190). KV connectors, however, still interpret request.block_hashes in units of the original GPU block size; for example OffloadingConnectorScheduler.get_num_new_matched_tokens asserts len(request.block_hashes) // self.block_size_factor == num_blocks where self.block_size_factor is derived from cache_config.block_size (offloading_connector.py lines 179‑181 and spec.py lines 33‑38). With DCP>1 and a KV connector enabled but prefix caching disabled, len(request.block_hashes) shrinks by the DCP factor and the assertion will fail or the connector will mis-index blocks. Either keep hashes at the GPU block granularity when only a connector is present or update the connector code to handle the larger hash stride.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
vllm/v1/engine/core.py
Outdated
| self.vllm_config.cache_config.enable_prefix_caching | ||
| or self.scheduler.get_kv_connector() is not None | ||
| ): | ||
| block_size = vllm_config.cache_config.block_size | ||
| hash_block_size = ( | ||
| vllm_config.cache_config.block_size | ||
| * vllm_config.parallel_config.decode_context_parallel_size | ||
| ) | ||
| caching_hash_fn = get_hash_fn_by_name( | ||
| vllm_config.cache_config.prefix_caching_hash_algo | ||
| ) | ||
| init_none_hash(caching_hash_fn) | ||
|
|
||
| self.request_block_hasher = get_request_block_hasher( | ||
| block_size, caching_hash_fn | ||
| hash_block_size, caching_hash_fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid scaling block hashes when only KV connectors use them
The new hash setup multiplies cache_config.block_size by decode_context_parallel_size before computing request.block_hashes. That makes each hash correspond to block_size × dcp_world_size tokens (engine/core.py lines 175‑190). KV connectors, however, still interpret request.block_hashes in units of the original GPU block size; for example OffloadingConnectorScheduler.get_num_new_matched_tokens asserts len(request.block_hashes) // self.block_size_factor == num_blocks where self.block_size_factor is derived from cache_config.block_size (offloading_connector.py lines 179‑181 and spec.py lines 33‑38). With DCP>1 and a KV connector enabled but prefix caching disabled, len(request.block_hashes) shrinks by the DCP factor and the assertion will fail or the connector will mis-index blocks. Either keep hashes at the GPU block granularity when only a connector is present or update the connector code to handle the larger hash stride.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @youzhedian
|
thank you for the fix. |
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
…ect#26296) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
…ect#26296) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…ect#26296) Signed-off-by: Chen Zhang <zhangch99@outlook.com>
…ect#26296) Signed-off-by: Chen Zhang <zhangch99@outlook.com>
### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with vllm-project/vllm@17c540a 1. refactor deepseek to the latest code arch as of vllm-project/vllm@17c540a 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by vllm-project/vllm#25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by vllm-project/vllm#26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by vllm-project/vllm#23485 - Fix `MLAAttention` import,caused by vllm-project/vllm#25103 - Fix `SharedFusedMoE` import, caused by vllm-project/vllm#26145 - Fix `LazyLoader` improt, caused by vllm-project/vllm#27022 - Fix `vllm.utils.swap_dict_values` improt, caused by vllm-project/vllm#26990 - Fix `Backend` enum import, caused by vllm-project/vllm#25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by vllm-project/vllm#26355 - Fix fused_moe ops, caused by vllm-project/vllm#24097 - Fix bert model because of `inputs_embeds`, caused by vllm-project/vllm#25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by vllm-project/vllm#24172 - Fix `splitting_ops` changes introduced by vllm-project/vllm#25845 - Fix multi-modality changes introduced by vllm-project/vllm#16229 - Fix lora bias dropping issue introduced by vllm-project/vllm#25807 - Fix structured ouput break introduced by vllm-project/vllm#26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>
…ect#26296) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…ect#26296) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…ect#26296) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Purpose
In DCP, the block_size is multiplied by DCP size. We should also apply this to the hash function
Test Plan
Test Result
The cache hit length should be < 150
before this PR, the cache hit length of the two requests are
after this PR, the cache hit length are:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.