Skip to content

[RFC]: Add a cache hit threshold to handle Preemptions in PD-Disaggregation and enable lightweight powerful P/D implementations #24256

@kfirwolfson

Description

@kfirwolfson

Motivation.

This RFC introduces cache hit–based admission control for requests, motivated by PD-Disaggregation and preemption handling scenarios.
Related llm-d ticket and documentation: here.

Edit: updated Sep 12, 2025 - added Preemption handling use-case

🧩 Summary

This RFC proposes adding an optional field to incoming requests and an optional global configuration to vLLM, which set a minimum KV-Cache hit rate for handling requests. Requests which are found to have a lower cache hit percentage than the given threshold, will be rejected with a specific new finish reason.

The cache hit rate includes cache from external and previously offloaded KV-Cache, obtained via the KVConnector, and not just the local APC.

This feature is useful in several use-cases in PD-Disaggregated deployments:

  1. Request Preemption: In P/D systems a Preempted request on a Decode instance ("Decoder") will perform recalculation (prefill) on the Decoder itself, in case cache was not offloaded, which can degrade Decoder performance and cascade into instance lockups. This feature allows an external inference management system such as llm-d, Dynamo, Production Stack or AIBrix to avoid prefill on the Decoder, and instead have vLLM return the request back to the Router or external worker which can perform P/D again and send the prefill work to the Prefill instance ("Prefiller").

  2. Lightweight powerful P/D implementation: In Decode-First scenarios, allowing the lookup (and thus disaggregation decision) to be taken in vLLM itself. The goal is to provide a simple method for vLLM to perform a single lookup and notify the Router when the hitrate is low.

The feature can also be used to implement a simple cache-aware routing, as detailed in "Other Scenarios" below.

Note, we use the general term Router relating to an entity external to vLLM which orchestrates P/D disaggregation and sends the request to the Prefill and Decode instances. Sometimes a Decode-Gateway is used as a component behind the Router, which performs the disaggregation logic and is attached to the Decode vLLM instance. Example Decode-Gateways are the Routing Sidecar in llm-d and DecodeWorker in Dynamo. We sometimes use the term Router below although the logic would reside in the Decode-Gateway for such systems.

1. Request Preemption

🧠 Background

Request Preemption in vLLM today evacuates the request's KV-cache blocks and they will be reallocated to other RUNNING requests. The preempted request will later be rescheduled (before other WAITING requests), but its local cache had already been discarded.
This means the full Prefill work is done internally inside the Decode instance: including both the prefill work originally done on the Prefiller and all new (possibly many) Output tokens calculated on the Decoder before preemption occurred. Tests in the field performed by the llm-d team showed this scenario leads to Decode instances starting to execute prefills and eventually locking up due to resource exhaustion.
Ideally, all this prefill work would have been done on the Prefiller, but the problem is that the external Router orchestrating P/D has no control over vLLM behavior once the Decode instance received the request.
The feature described in this RFC changes that by giving control back to the Router. The unfortunate scenario can be avoided by the router setting a cache hit-rate threshold for the request, below which the Decoder will reject the prefill work and send the request back to the calling Router, which in turn can perform P/D disaggregation as originally intended.

Note that there may still be some cache hit in local cache for preempted requests due to a shared context prefix with other requests, such as the system prompt.

Using external offloaded cache, especially in a shared storage described later in this RFC, can mitigate the problem described above in the request preemption scenario, since tokens can be offloaded during prefill or decode and the Decoder can "pick up where it left off" when the request is rescheduled. Note this requires a KV Connector which supports background "Sync" offloading in parallel to token compute. However, if offloading wasn't configured or the storage is not shared between Prefill and Decode instances, the cache is most probably lost and we've reached the scenario of long prefill jobs on the Decoder.

Note that problematic scenarios are more common with workloads that produce longer outputs, which increase the likelihood of request preemption on the Decode instance: as more tokens are generated, more KV-cache blocks are required. With vLLM batching requests in parallel, cache memory may eventually be exhausted, triggering preemption. Long outputs compound the problem: the longer the outputs, the more prefill work the Decoder must redo after preemption discards the cache.
Because Prefiller instances do not perform decode work, preemption is not expected. In any case, re-executing prefills is acceptable for Prefillers.

📝 Details

As detailed below under "Proposed Change", this feature allows the caller to optionally add a cache hit threshold for requests. The threshold can be chosen dynamically per request which enables the Orchestration engine to tune the exact amount of prefill work it allows the Decoder to perform.

⚡ Optimization (phase 2)

We suggest the following optimization for preempted request: avoid (Decoder) recalculation of output tokens already calculated before preemption. This will be achieved by performing prefill on these calculated output tokens. The prefill should be done on the Prefiller, which will receive them as part of the prompt.
Suggested implementation:

  • When rejecting the request due to cache threshold, vLLM will also return the output tokens back to the Router.
  • The Router will then clone a new request based on the original request, with the following changes:
    • The prompt should include the original prompt appended with the output tokens
    • The number of output tokens should be subtracted from the max_tokens and max_completion_tokens fields
  • The new request will be treated in a similar way to the original request, sending it to the Prefiller to perform prefill and Decoder to generate new tokens.
  • Note the format of the tokens returned in the response should adhere to the original request, e.g. take into account request's return_token_ids and return_tokens_as_token_ids fields.
    This optimization is optional and may be implemented in a follow-up PR once the core threshold mechanism is merged.

2. Lightweight powerful P/D

🧠 Background

Some PD-Disaggregation systems rely on cache-aware routers with global indexes which are constantly updated using dedicated “KV Events”. Global indexes can become complex and expensive as the scale of the cache system grows to petabyte level storage.

An important alternative is to have the indexing done at the storage layer. Accessing this index can be either external to VLLM, e.g. in KV Cache Scorers in llm-d or by KV Indexer in Dynamo, or inside vLLM as part of the KVConnector API call to get_num_new_matched_tokens(). In many cases, the latter is simpler and more performant, as it does not require access from the router to the storage and reduces the number of lookups per request (vLLM will always perform another lookup via the KVConnector API).

📝 Details

The diagram below shows an example simple PD-D system, which employs the “Decode-First” approach with the decision of whether to perform disaggregation done after the “Decode vLLM” instance performs the cache lookup. The KV Cache storage can be of any sort, local or external, shared or independent for each instance, etc. The logic for lookup and access into the cache is managed as usual by vLLM’s KVConnector.

In common workloads such as Multi-turn conversations, where cache hit ratio is typically very high (above 90%), the flow will usually avoid disaggregation, obtain tokens from cache and perform both prefill and decode on the Decode instance. Since the cache hit is high, the prefill work is very small, e.g. below 10% of the tokens.

As described in the diagram, requests are sent to the Decode vLLM instance (step 1) with an optional cache hit ratio field, set to the disaggregation threshold the system has chosen (and can dynamically change over time). The Decode instance performs a lookup via the KVConnector and checks the cache hit ratio (step 2).

In the case of cache hit above the threshold, (step A3), the Scheduler continues processing the request (loading cache, computing response) and the output is sent back up the stack (step A4). If the cache hit ratio is lower, it notifies the calling Router (or Decode-Gateway) of that. The Router then continues with the regular PD-Disaggregation flow (steps B4, B5) and output is sent to the user when the decode work completes (step B6).

Note that the point of choosing the P and D instances is orthogonal, and can be done at any time, e.g. before the request arrives at the Decode-Gateway (before step 1) or only when needed (step B3).

Image

🌐 Other Scenarios

The cache hit ratio feature can be useful in other scenarios. For instance, it allows implementing simple cache aware routing without cache placement knowledge in the router. Consider a system with local KV Cache on VRAM and possibly CPU DRAM and local SSDs as well, on a set of K vLLM instances. One implementation of KV-Cache “aware” routing is as simple as having the router send the request to all K instances in parallel with an expected cache hit rate, and only the instance which actually has the tokens cached will handle the request. If all instances return that they do not have enough cache, the router sends the request with a threshold of 0 to just one instance it chooses (by any heuristic), which handles the request. This simplistic design is not as robust as large system, and can of course be extended with error handling and other layers, but exemplifies the usefulness of having this easy-to-implement flexibility in vLLM.

Proposed Change.

🔧 API Changes

  • New global config: --global-cache-hit-threshold, float ∈ [0.0, 1.0] , default 0.0
  • New per-request field: cache_hit_threshold, float ∈ [0.0, 1.0]
  • Request-level value overrides global. Having both gives more flexibility: allows both static configuration and dynamic overrides by routers

🔁 Behavioral changes

  • When: Scheduler computes the hit rate after call to get_num_new_matched_tokens(), before request scheduling or cache block allocation, ensuring lightweight rejection.
  • How: hit rate = (local + external) cached tokens divided by prompt length.
  • If hit rate < threshold, reject request.

🚫 Rejecting requests

Option A (preferred):

  • Continue returning HTTP 200 OK
  • Introduce a new finish_reason string: cache_threshold
  • This aligns with existing finish_reason semantics (stop, length, abort)
  • Returning HTTP 200 OK preserves OpenAI-API compatibility and enables returning partial token data (for example, from preempted requests).

Option B (alternative):

  • Return an HTTP error (e.g. 422 Unprocessable Content / Entity) with a standard error object
  • This expresses the semantic rejection
  • Note: this option is not compatible with the proposed preemption optimization (returning computed output tokens in the rejection response), since error responses do not include choices or token payloads.

🔗 Compatibility

  • Default threshold is 0.0, so existing deployments see no behavioral change
  • API remains compatible with OpenAI-style clients; the only additions are the optional cache_hit_threshold field in the request and a new finish_reason value.

⚙️ Implementation Notes

The feature is expected to involve minimal changes limited to the Scheduler and request parsing layers.

  • Configuration flag added to scheduler_config.
  • Optional request field parsed via existing request model.
  • New finish_reason string constant appended to the enum in completion.py.
  • Unit tests should verify both rejection paths (global and per-request thresholds) and the new finish_reason
    No backward-incompatible interface or schema changes are introduced.

Feedback Period.

No response

CC List.

@robertgshaw2-redhat, @njhill, @tlrmchlsmth @KuntaiDu @YaoJiayi @ApostaC @vMaroon @orozery

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions