-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[V1] Exception Handling when Loading KV Cache from Remote Store #21534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an exception handling mechanism for failures when loading the KV cache from a remote store. The approach involves pre-allocating KV cache blocks, attempting to load from the remote store, and then scheduling requests based on the number of tokens that were successfully loaded.
The changes are well-structured, primarily involving plumbing a new finish_loading_dict from the worker to the scheduler to communicate the load status. The core logic in the scheduler and model runner appears sound. My review has identified a couple of issues in the new mock connector (async_offload_connector.py) used for testing, which include type mismatches and an incorrect return type. Addressing these will improve the correctness and robustness of the test suite.
tests/v1/kv_connector/kv_load_exception_handling/async_offload_connector.py
Outdated
Show resolved
Hide resolved
tests/v1/kv_connector/kv_load_exception_handling/async_offload_connector.py
Outdated
Show resolved
Hide resolved
tests/v1/kv_connector/kv_load_exception_handling/async_offload_connector.py
Outdated
Show resolved
Hide resolved
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
b3862a0 to
dfa2dbc
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
dfa2dbc to
286057f
Compare
ebf30e2 to
609b7c9
Compare
ApostaC
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this PR! It looks clean and the changes are small. Just left a few quick comment.
Also, fault tolerance is a very important topic, so thanks for contributing!
vllm/v1/core/sched/scheduler.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a comment here to briefly describe what the keys and values are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the design that it doesn't need to modify any of the connector API. But a minor concern is that we didn't have any docstring describing get_finish_loading's expected behavior.
One proposal is add this function into KV connector's base class and give it a dummy implementation like return {}. In this case, we can also skip hasattr check here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed feedback! I agree that adding a docstring for get_finished_loading would make the behavior clearer, and moving it to the base class with a default implementation is a great idea to simplify the code. Appreciate your suggestions! 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change the name to "RandomDropConnector" or some other descriptive name and docstring that this is for fault tolerance testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion! I've renamed the class to RandomDropConnector and updated the docstring to explicitly mention that it's designed for fault tolerance testing. This should make the purpose of the class much clearer. Thanks for pointing this out! 😊
vllm/v1/core/sched/scheduler.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
609b7c9 to
01659a7
Compare
|
Thanks for these suggestions! I've added some comments and moving get_finished_loading to the base class. Let me know if there's anything else that needs improvement! 😊
|
01659a7 to
d41ca56
Compare
KuntaiDu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
|
Hi @liuyumoye, @ApostaC, @KuntaiDu, Please note that this PR was merged without any reference to the ongoing work in #19330, which has been open for several weeks and discussed extensively with both community members and core maintainers. Given that some of the contributors and reviewers here were aware of that effort, it would have been helpful to coordinate or at least cross-reference to ensure alignment. This PR appears to overlap significantly with the goals and scope of #19330, which introduces KV load failure recovery — a feature that has been carefully designed, reviewed, and extended with a clear roadmap. Moreover, this PR addresses only asynchronous KV loading, lacks test coverage, and moves us further away from the planned block-based connector API. In contrast, #19330 provides support for both sync and async KV loading, and tensor/pipeline parallelism, along with a more comprehensive and validated test setup. To avoid duplication, ensure correctness, and converge on a unified and robust solution, I suggest reverting this PR and continuing the work within the scope of #19330. That PR was ready to merge until just a few hours ago and can move forward immediately once this is reverted. Tagging @njhill @youkaichao @robertgshaw2-redhat @NickLucche @orozery for visibility. I’d appreciate your thoughts on how we can align and move forward collaboratively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example only runs an online serving instance without specifying input or expected output. While it can verify that requests with load failures complete, it does not validate the correctness of the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this looks unnecessary, it should probably belong in examples or even better just in the docs
| finished_set.add(req_id) | ||
| del remaining_count_dict[req_id] | ||
|
|
||
| def update_finished_load_dict(worker_finished_loading_dict: dict[str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This aggregation logic assumes that all outputs arrive in the same step. If that’s not the case, it may prematurely propagate the first worker’s output back to the scheduler. In particular, it doesn’t appear to support pipeline-parallel setups.
| if num_actual_load_tokens <= 0 and hasattr(self.connector, | ||
| "add_failure_request"): | ||
| self.connector.add_failure_request(request) | ||
| return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases, when a request is rescheduled with 'num_computed_tokens = 0', a local cache hit may occur, which can lead to a failure during 'save_new_computed_blocks' (see: https://github.com/vllm-project/vllm/blob/2bb7358d467e00b3a0472516d32ef99cc75d456f/vllm/v1/core/sched/scheduler.py#L1013).
| (block_ids, ) = self.kv_cache_manager.get_block_ids(request.request_id) | ||
| return self.connector.request_finished(request, block_ids) | ||
|
|
||
| def _update_actual_load_token_num_from_remote_kv(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should likely also handle removing the request from finished_recving_kv_req_ids
| if num_actual_load_tokens <= 0 and hasattr(self.connector, | ||
| "add_failure_request"): | ||
| self.connector.add_failure_request(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code assumes the presence of a custom method (add_failure_request) that is not part of the official KVConnectorBase_V1 interface. Calling it from core logic breaks abstraction and risks incompatibility. If this functionality is necessary, it should be formalized as part of the KVConnectorBase_V1 interface.
Arguably, the add_failure_request method is not necessary in the core interface. Failure handling can be delegated to a custom connector implementation that wraps the actual connector and provides the desired behavior, keeping the core KVConnectorBase_V1 interface clean and minimal.
|
I think we just lacked some communication and we had fewer eyes on it over the weekend. |
Hi @sdavidbd @NickLucche, Thank you for the reminder. I apologize for not being fully aware of the PR merging guidelines. I’ll make sure to reference any ongoing work in future PRs and follow the proper process moving forward. I appreciate your understanding and will be more mindful of this in the future. The primary issue we’re currently focused on is exception handling in asynchronous loading scenarios. As technology evolves, we anticipate that asynchronously loading the KV Cache from remote storage will become a common operation for connectors. Therefore, our goal is to handle exceptions for these asynchronous loading scenarios. When comparing different approaches, we found that while the block-based solution offers more comprehensive coverage, it is also significantly more complex to implement. In practice, KV loading exception are already rare, and the scenario where an intermediate block fails to load while subsequent blocks succeed within a continuous request is even less common (especially given that eviction policies like LRU are typically in place). Additionally, to fully leverage non-contiguous blocks, further adjustments to the attention metadata would be required to enable segmented execution of flash attention. However, whether these changes would yield significant performance improvements remains to be validated. In contrast, the request-based solution is more lightweight and requires fewer modifications to vLLM’s existing scheduling logic. Rather than overcomplicating the design, perhaps the most suitable solution is the best one. Tagging @youkaichao @KuntaiDu @ApostaC for visibility. I’d love to hear your input on how we can align our efforts and work together to move this forward. |
|
Hi @liuyumoye — thanks for your response, I appreciate your openness to align going forward. Regarding the block-based failure recovery approach in #19330, I wanted to briefly highlight some of the key motivations behind that direction — beyond enabling future reuse of non-contiguous blocks (which I believe is both feasible and increasingly important, especially for RAG-style and cache-similarity-based use cases). There are also immediate benefits:
The longer-term goals are also quite tangible: If you haven’t had a chance yet, I’d recommend catching up on the discussions in #19330 — many of these tradeoffs were discussed there in more depth. Looking forward to aligning and building a unified solution! |
| num_actual_load_tokens = random.randint(0, len(hit_tokens)) | ||
| return num_actual_load_tokens | ||
|
|
||
| def get_finished_loading(self) -> dict[str, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it has different interface as KVConnectorBase_V1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it has different interface as
KVConnectorBase_V1
Thanks for reminding, the subsequent we intend to build a RandomDropStorageConnector for actual load and discarding the tokens in #22075
Hi @sdavidbd - Thank you for explaining the highlights of the block-based strategy. I have a few questions and would like to discuss them further:
Looking forward to collaborating on a unified solution that addresses these challenges effectively! |
Let me know if you see any gaps or have a different take on these cases. |
I'm glad we agree that such scenarios are rare, so I believe a lightweight solution should be adopted to handle these errors in order to avoid introducing too many modifications to the vLLM system. |
…-project#21534) Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
…-project#21534) Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> Co-authored-by: liuyumoye <adeline_ly2023@outlook.com> Signed-off-by: x22x22 <wadeking@qq.com>
…-project#21534) Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
…-project#21534) Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
…-project#21534) Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> Co-authored-by: liuyumoye <adeline_ly2023@outlook.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…-project#21534) Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> Co-authored-by: liuyumoye <adeline_ly2023@outlook.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…-project#21534) Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> Co-authored-by: liuyumoye <adeline_ly2023@outlook.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…-project#21534) Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
…-project#21534) Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
Purpose
This PR implements an exception handling mechanism for cases where loading the KV cache from remote storage fails.
The key point of this scheme is that the kv cache blocks are allocated in advance to load the remote kv store, wait for the load to complete, and then schedule the request to execute according to the actual number of loaded tokens.
We reused the process and some interfaces of the Nixl connector to minimize additional modifications to vLLM. Naturally, this mechanism is also applicable to the Nixl Connector, requiring only minimal adaptation.
Test Plan
In folder
tests/v1/kv_connector/kv_load_exception_handling, I have written a mock async_offload_connector that simulates the scenario where the actual number of KV tokens loaded is less than the prompt length.Test Result
You can use the following command to start a vLLM service that includes the mock connector and you can send any request to this service