Skip to content

Conversation

@liuyumoye
Copy link
Contributor

@liuyumoye liuyumoye commented Jul 24, 2025

Purpose

This PR implements an exception handling mechanism for cases where loading the KV cache from remote storage fails.

The key point of this scheme is that the kv cache blocks are allocated in advance to load the remote kv store, wait for the load to complete, and then schedule the request to execute according to the actual number of loaded tokens.

We reused the process and some interfaces of the Nixl connector to minimize additional modifications to vLLM. Naturally, this mechanism is also applicable to the Nixl Connector, requiring only minimal adaptation.

Test Plan

In folder tests/v1/kv_connector/kv_load_exception_handling, I have written a mock async_offload_connector that simulates the scenario where the actual number of KV tokens loaded is less than the prompt length.

Test Result

You can use the following command to start a vLLM service that includes the mock connector and you can send any request to this service

cd tests/v1/kv_connector/kv_load_exception_handling
bash test.sh

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an exception handling mechanism for failures when loading the KV cache from a remote store. The approach involves pre-allocating KV cache blocks, attempting to load from the remote store, and then scheduling requests based on the number of tokens that were successfully loaded.

The changes are well-structured, primarily involving plumbing a new finish_loading_dict from the worker to the scheduler to communicate the load status. The core logic in the scheduler and model runner appears sound. My review has identified a couple of issues in the new mock connector (async_offload_connector.py) used for testing, which include type mismatches and an incorrect return type. Addressing these will improve the correctness and robustness of the test suite.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@liuyumoye liuyumoye force-pushed the exception_handling branch from b3862a0 to dfa2dbc Compare July 25, 2025 02:24
@mergify
Copy link

mergify bot commented Jul 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @liuyumoye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 25, 2025
@liuyumoye liuyumoye force-pushed the exception_handling branch from dfa2dbc to 286057f Compare July 25, 2025 03:00
@mergify mergify bot removed the needs-rebase label Jul 25, 2025
@liuyumoye liuyumoye force-pushed the exception_handling branch 7 times, most recently from ebf30e2 to 609b7c9 Compare July 25, 2025 11:33
Copy link
Collaborator

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this PR! It looks clean and the changes are small. Just left a few quick comment.
Also, fault tolerance is a very important topic, so thanks for contributing!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a comment here to briefly describe what the keys and values are

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment on lines 57 to 62
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the design that it doesn't need to modify any of the connector API. But a minor concern is that we didn't have any docstring describing get_finish_loading's expected behavior.

One proposal is add this function into KV connector's base class and give it a dummy implementation like return {}. In this case, we can also skip hasattr check here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed feedback! I agree that adding a docstring for get_finished_loading would make the behavior clearer, and moving it to the base class with a default implementation is a great idea to simplify the code. Appreciate your suggestions! 😊

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change the name to "RandomDropConnector" or some other descriptive name and docstring that this is for fault tolerance testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! I've renamed the class to RandomDropConnector and updated the docstring to explicitly mention that it's designed for fault tolerance testing. This should make the purpose of the class much clearer. Thanks for pointing this out! 😊

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@liuyumoye liuyumoye force-pushed the exception_handling branch from 609b7c9 to 01659a7 Compare July 26, 2025 05:23
@liuyumoye
Copy link
Contributor Author

Thanks for these suggestions! I've added some comments and moving get_finished_loading to the base class. Let me know if there's anything else that needs improvement! 😊

I like this PR! It looks clean and the changes are small. Just left a few quick comment. Also, fault tolerance is a very important topic, so thanks for contributing!

@liuyumoye liuyumoye force-pushed the exception_handling branch from 01659a7 to d41ca56 Compare July 26, 2025 06:05
Copy link
Collaborator

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@KuntaiDu KuntaiDu enabled auto-merge (squash) July 26, 2025 21:22
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 26, 2025
Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
@KuntaiDu KuntaiDu enabled auto-merge (squash) July 28, 2025 02:21
@vllm-bot vllm-bot merged commit 15a72ac into vllm-project:main Jul 28, 2025
68 of 70 checks passed
@sdavidbd
Copy link
Contributor

Hi @liuyumoye, @ApostaC, @KuntaiDu,

Please note that this PR was merged without any reference to the ongoing work in #19330, which has been open for several weeks and discussed extensively with both community members and core maintainers. Given that some of the contributors and reviewers here were aware of that effort, it would have been helpful to coordinate or at least cross-reference to ensure alignment.

This PR appears to overlap significantly with the goals and scope of #19330, which introduces KV load failure recovery — a feature that has been carefully designed, reviewed, and extended with a clear roadmap.

Moreover, this PR addresses only asynchronous KV loading, lacks test coverage, and moves us further away from the planned block-based connector API. In contrast, #19330 provides support for both sync and async KV loading, and tensor/pipeline parallelism, along with a more comprehensive and validated test setup.

To avoid duplication, ensure correctness, and converge on a unified and robust solution, I suggest reverting this PR and continuing the work within the scope of #19330. That PR was ready to merge until just a few hours ago and can move forward immediately once this is reverted.

Tagging @njhill @youkaichao @robertgshaw2-redhat @NickLucche @orozery for visibility. I’d appreciate your thoughts on how we can align and move forward collaboratively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example only runs an online serving instance without specifying input or expected output. While it can verify that requests with load failures complete, it does not validate the correctness of the output.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this looks unnecessary, it should probably belong in examples or even better just in the docs

finished_set.add(req_id)
del remaining_count_dict[req_id]

def update_finished_load_dict(worker_finished_loading_dict: dict[str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This aggregation logic assumes that all outputs arrive in the same step. If that’s not the case, it may prematurely propagate the first worker’s output back to the scheduler. In particular, it doesn’t appear to support pipeline-parallel setups.

if num_actual_load_tokens <= 0 and hasattr(self.connector,
"add_failure_request"):
self.connector.add_failure_request(request)
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases, when a request is rescheduled with 'num_computed_tokens = 0', a local cache hit may occur, which can lead to a failure during 'save_new_computed_blocks' (see: https://github.com/vllm-project/vllm/blob/2bb7358d467e00b3a0472516d32ef99cc75d456f/vllm/v1/core/sched/scheduler.py#L1013).

(block_ids, ) = self.kv_cache_manager.get_block_ids(request.request_id)
return self.connector.request_finished(request, block_ids)

def _update_actual_load_token_num_from_remote_kv(self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should likely also handle removing the request from finished_recving_kv_req_ids

Comment on lines +1107 to +1109
if num_actual_load_tokens <= 0 and hasattr(self.connector,
"add_failure_request"):
self.connector.add_failure_request(request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code assumes the presence of a custom method (add_failure_request) that is not part of the official KVConnectorBase_V1 interface. Calling it from core logic breaks abstraction and risks incompatibility. If this functionality is necessary, it should be formalized as part of the KVConnectorBase_V1 interface.

Arguably, the add_failure_request method is not necessary in the core interface. Failure handling can be delegated to a custom connector implementation that wraps the actual connector and provides the desired behavior, keeping the core KVConnectorBase_V1 interface clean and minimal.

@NickLucche
Copy link
Collaborator

NickLucche commented Jul 28, 2025

I think we just lacked some communication and we had fewer eyes on it over the weekend.
I am also in favor of reverting to a clean state and address all concerns that may have been risen from this work over to #19330 .
We can also probably re-use the unit test we have here as an additional case.
Regardless, thanks a lot for the work done in this PR.

@liuyumoye
Copy link
Contributor Author

liuyumoye commented Jul 28, 2025

Hi @liuyumoye, @ApostaC, @KuntaiDu,

Please note that this PR was merged without any reference to the ongoing work in #19330, which has been open for several weeks and discussed extensively with both community members and core maintainers. Given that some of the contributors and reviewers here were aware of that effort, it would have been helpful to coordinate or at least cross-reference to ensure alignment.

This PR appears to overlap significantly with the goals and scope of #19330, which introduces KV load failure recovery — a feature that has been carefully designed, reviewed, and extended with a clear roadmap.

Moreover, this PR addresses only asynchronous KV loading, lacks test coverage, and moves us further away from the planned block-based connector API. In contrast, #19330 provides support for both sync and async KV loading, and tensor/pipeline parallelism, along with a more comprehensive and validated test setup.

To avoid duplication, ensure correctness, and converge on a unified and robust solution, I suggest reverting this PR and continuing the work within the scope of #19330. That PR was ready to merge until just a few hours ago and can move forward immediately once this is reverted.

Tagging @njhill @youkaichao @robertgshaw2-redhat @NickLucche @orozery for visibility. I’d appreciate your thoughts on how we can align and move forward collaboratively.

Hi @sdavidbd @NickLucche,

Thank you for the reminder. I apologize for not being fully aware of the PR merging guidelines. I’ll make sure to reference any ongoing work in future PRs and follow the proper process moving forward. I appreciate your understanding and will be more mindful of this in the future.

The primary issue we’re currently focused on is exception handling in asynchronous loading scenarios. As technology evolves, we anticipate that asynchronously loading the KV Cache from remote storage will become a common operation for connectors. Therefore, our goal is to handle exceptions for these asynchronous loading scenarios.

When comparing different approaches, we found that while the block-based solution offers more comprehensive coverage, it is also significantly more complex to implement. In practice, KV loading exception are already rare, and the scenario where an intermediate block fails to load while subsequent blocks succeed within a continuous request is even less common (especially given that eviction policies like LRU are typically in place). Additionally, to fully leverage non-contiguous blocks, further adjustments to the attention metadata would be required to enable segmented execution of flash attention. However, whether these changes would yield significant performance improvements remains to be validated.

In contrast, the request-based solution is more lightweight and requires fewer modifications to vLLM’s existing scheduling logic. Rather than overcomplicating the design, perhaps the most suitable solution is the best one.

Tagging @youkaichao @KuntaiDu @ApostaC for visibility. I’d love to hear your input on how we can align our efforts and work together to move this forward.

KuntaiDu added a commit that referenced this pull request Jul 28, 2025
…re (#21534)"

This reverts commit 15a72ac.

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
@sdavidbd
Copy link
Contributor

Hi @liuyumoye — thanks for your response, I appreciate your openness to align going forward.

Regarding the block-based failure recovery approach in #19330, I wanted to briefly highlight some of the key motivations behind that direction — beyond enabling future reuse of non-contiguous blocks (which I believe is both feasible and increasingly important, especially for RAG-style and cache-similarity-based use cases).

There are also immediate benefits:

  • Connector visibility: The connector naturally sees which blocks fail but typically lacks the context to map failures back to specific requests. Avoiding per-request tracking keeps connector implementations simpler and more decoupled.

  • One-to-many dependency: A single failed block can affect multiple requests within a batch. Reporting failed blocks allows the scheduler — which already maintains request-to-block mappings — to identify and reschedule any affected request.

  • Granularity of recovery: In many cases, only a small subset of a request’s blocks may fail. Discarding the entire request would be wasteful, especially if most of the data has been successfully loaded. Block-level reporting enables fine-grained recovery.

The longer-term goals are also quite tangible:
We're actively working on enhancing KV reuse beyond prefix matching, and this design directly supports that direction. It also aligns with other community-driven improvements like the block-based CPU offloading connector (see #19854).

If you haven’t had a chance yet, I’d recommend catching up on the discussions in #19330 — many of these tradeoffs were discussed there in more depth.

Looking forward to aligning and building a unified solution!

num_actual_load_tokens = random.randint(0, len(hit_tokens))
return num_actual_load_tokens

def get_finished_loading(self) -> dict[str, int]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has different interface as KVConnectorBase_V1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has different interface as KVConnectorBase_V1

Thanks for reminding, the subsequent we intend to build a RandomDropStorageConnector for actual load and discarding the tokens in #22075

@liuyumoye
Copy link
Contributor Author

Hi @liuyumoye — thanks for your response, I appreciate your openness to align going forward.

Regarding the block-based failure recovery approach in #19330, I wanted to briefly highlight some of the key motivations behind that direction — beyond enabling future reuse of non-contiguous blocks (which I believe is both feasible and increasingly important, especially for RAG-style and cache-similarity-based use cases).

There are also immediate benefits:

  • Connector visibility: The connector naturally sees which blocks fail but typically lacks the context to map failures back to specific requests. Avoiding per-request tracking keeps connector implementations simpler and more decoupled.
  • One-to-many dependency: A single failed block can affect multiple requests within a batch. Reporting failed blocks allows the scheduler — which already maintains request-to-block mappings — to identify and reschedule any affected request.
  • Granularity of recovery: In many cases, only a small subset of a request’s blocks may fail. Discarding the entire request would be wasteful, especially if most of the data has been successfully loaded. Block-level reporting enables fine-grained recovery.

The longer-term goals are also quite tangible: We're actively working on enhancing KV reuse beyond prefix matching, and this design directly supports that direction. It also aligns with other community-driven improvements like the block-based CPU offloading connector (see #19854).

If you haven’t had a chance yet, I’d recommend catching up on the discussions in #19330 — many of these tradeoffs were discussed there in more depth.

Looking forward to aligning and building a unified solution!

Hi @sdavidbd - Thank you for explaining the highlights of the block-based strategy.

I have a few questions and would like to discuss them further:

  1. In the scenario of synchronous execution with disaggregated prefill, If a direct layerwise transfer approach is adopted between the Prefill instance and the Decode instance, the current rescheduling method might result in incorrect KV cache being passed to Decode.
  2. In the asynchronous scenario, we believe that the situation where "reading from remote storage fails, and only a small subset of a request’s blocks may fail" is relatively rare, or it might be related to the eviction strategy of the KV cache storage. However, I noticed that you consider this a common occurrence. Could you elaborate on these scenarios? Perhaps the contexts we’re dealing with differ.

Looking forward to collaborating on a unified solution that addresses these challenges effectively!

@sdavidbd
Copy link
Contributor

I have a few questions and would like to discuss them further:

  1. In the scenario of synchronous execution with disaggregated prefill, If a direct layerwise transfer approach is adopted between the Prefill instance and the Decode instance, the current rescheduling method might result in incorrect KV cache being passed to Decode.
  2. In the asynchronous scenario, we believe that the situation where "reading from remote storage fails, and only a small subset of a request’s blocks may fail" is relatively rare, or it might be related to the eviction strategy of the KV cache storage. However, I noticed that you consider this a common occurrence. Could you elaborate on these scenarios? Perhaps the contexts we’re dealing with differ.

Looking forward to collaborating on a unified solution that addresses these challenges effectively!

  1. On the "direct layerwise transfer" setup — if you're referring to streaming KV blocks from Prefill instance to Decode instance as they're computed, then yes, there’s a risk of invalid data being transferred when computation depends on externally cached blocks that failed to load. The proposed recovery mechanism addresses this by rescheduling those blocks for recomputation. As for retransferring, I think the transfer connector should handle it. If it decides which blocks to send based on num_computed_tokens, then once the request is updated, the necessary blocks should be re-sent automatically.

  2. I agree that loading errors should generally be rare. My point was that when failures do happen — for example, due to a temporary network issue — they can impact any block, not necessarily an entire request. This can also happen with eviction strategies or tiered storage. The block-based approach helps by tracking failures at a more granular level, making it possible to recover more efficiently without discarding unaffected parts.

Let me know if you see any gaps or have a different take on these cases.

@liuyumoye
Copy link
Contributor Author

I have a few questions and would like to discuss them further:

  1. In the scenario of synchronous execution with disaggregated prefill, If a direct layerwise transfer approach is adopted between the Prefill instance and the Decode instance, the current rescheduling method might result in incorrect KV cache being passed to Decode.
  2. In the asynchronous scenario, we believe that the situation where "reading from remote storage fails, and only a small subset of a request’s blocks may fail" is relatively rare, or it might be related to the eviction strategy of the KV cache storage. However, I noticed that you consider this a common occurrence. Could you elaborate on these scenarios? Perhaps the contexts we’re dealing with differ.

Looking forward to collaborating on a unified solution that addresses these challenges effectively!

  1. On the "direct layerwise transfer" setup — if you're referring to streaming KV blocks from Prefill instance to Decode instance as they're computed, then yes, there’s a risk of invalid data being transferred when computation depends on externally cached blocks that failed to load. The proposed recovery mechanism addresses this by rescheduling those blocks for recomputation. As for retransferring, I think the transfer connector should handle it. If it decides which blocks to send based on num_computed_tokens, then once the request is updated, the necessary blocks should be re-sent automatically.
  2. I agree that loading errors should generally be rare. My point was that when failures do happen — for example, due to a temporary network issue — they can impact any block, not necessarily an entire request. This can also happen with eviction strategies or tiered storage. The block-based approach helps by tracking failures at a more granular level, making it possible to recover more efficiently without discarding unaffected parts.

Let me know if you see any gaps or have a different take on these cases.

I'm glad we agree that such scenarios are rare, so I believe a lightweight solution should be adopted to handle these errors in order to avoid introducing too many modifications to the vLLM system.
Additionally, thank you for your review of this PR—your feedback has been very helpful. I’ve made improvements based on your suggestions and have submitted a new PR #22075. Please feel free to share any further thoughts!

HsChen-sys pushed a commit to HsChen-sys/vllm that referenced this pull request Aug 1, 2025
…-project#21534)

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
…-project#21534)

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
Signed-off-by: x22x22 <wadeking@qq.com>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
…-project#21534)

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
…-project#21534)

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
…-project#21534)

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
…-project#21534)

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
…-project#21534)

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
…-project#21534)

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
…-project#21534)

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants