-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Description
Motivation
This RFC proposes a KV-Cache Interoperability API, covering standardized notification events (via KVEvents) and reproducible prefix-block hashing. These standards aim to support cross-system cache awareness, observability, and future tooling for indexing, routing, and diagnostics.
vLLM already ships with internal KVEvents contributed by the NVIDIA Dynamo team - that’s a strong foundation.
But as external systems aim for cache-aware inference, we need to treat these internal mechanisms as public contracts to support broader adoption and interop.
Goals
-
KVEvents Internal API as a Public Contract
The KVEvents schema is already well-defined in vLLM and used internally by theKVCacheManager
for GPU cache events. It’s also being extended to CPU offloading via theKVConnector
(see #19854).
This RFC proposes formalizing KVEvents as the public contract for any component emitting or consuming KV-Cache lifecycle events - including external indexers, routers, and engines. -
Ensure Reproducible Block Hashing Across Languages
Prefix cache block keys must be computed the same way across runtimes (e.g., Python, Go). This requires:- Canonical serialization (e.g., CBOR)
- Consistent hashing algorithms (e.g., SHA256, xxHash)
- Defined structure for input objects (e.g., token arrays,
extra_keys
) - Explicit rules for special cases like
NONE_HASH
root - Alignment on security features such as per-request hash-salting
Disclaimer: in the current KVEvents schema, the token-ids are sent along their block-hashes, which makes external indexing possible through mapping tokens -> different-hashes -> vLLM-hashes. While this avoids introducing reproducible hashing and configuration syncs, it requires complex indexing and lookups, along with the networking overhead of passing the 32bit token-ids in every event.
-
Enable Language-Agnostic Interop
Develop shared guidance and reference libraries in Python, Go, and other widely used languages. These utilities do not need to reside within vLLM, but should remain consistent with its specifications.
Proposed Change
This RFC proposes standardizing two core aspects of KV-Cache awareness:
1. KVEvents Schema
- Reuse the existing KVEvents format used internally in vLLM as a versioned public interface for any KV-Cache publisher or consumer
- Consider light refactoring:
- Use
bytes
for hashes instead of Python-nativeint
- Reduce required fields where appropriate
- Use
2. Prefix Block Hashing
-
Use CBOR (canonical mode) for serializing token arrays and metadata
- Other canonical algorithms are welcome. Today serialization is coupled with Python.
-
Support multiple standard hash functions (
SHA256
,xxHash
) -
Gradually migrate to defaulting the non-language-coupled options
PR on first two points:
These changes will support consistent block identity and event interpretation across runtimes, enabling robust interop between cache indexers, and routing layers.
CC List
@robertgshaw2-redhat @njhill @YaoJiayi @dannyharnik @orozery
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.