Skip to content

[RFC]: KV-Cache Interoperability API Standardization #20492

@vMaroon

Description

@vMaroon

Motivation

This RFC proposes a KV-Cache Interoperability API, covering standardized notification events (via KVEvents) and reproducible prefix-block hashing. These standards aim to support cross-system cache awareness, observability, and future tooling for indexing, routing, and diagnostics.

vLLM already ships with internal KVEvents contributed by the NVIDIA Dynamo team - that’s a strong foundation.
But as external systems aim for cache-aware inference, we need to treat these internal mechanisms as public contracts to support broader adoption and interop.

Goals

  1. KVEvents Internal API as a Public Contract
    The KVEvents schema is already well-defined in vLLM and used internally by the KVCacheManager for GPU cache events. It’s also being extended to CPU offloading via the KVConnector (see #19854).
    This RFC proposes formalizing KVEvents as the public contract for any component emitting or consuming KV-Cache lifecycle events - including external indexers, routers, and engines.

  2. Ensure Reproducible Block Hashing Across Languages
    Prefix cache block keys must be computed the same way across runtimes (e.g., Python, Go). This requires:

    • Canonical serialization (e.g., CBOR)
    • Consistent hashing algorithms (e.g., SHA256, xxHash)
    • Defined structure for input objects (e.g., token arrays, extra_keys)
    • Explicit rules for special cases like NONE_HASH root
    • Alignment on security features such as per-request hash-salting

    Disclaimer: in the current KVEvents schema, the token-ids are sent along their block-hashes, which makes external indexing possible through mapping tokens -> different-hashes -> vLLM-hashes. While this avoids introducing reproducible hashing and configuration syncs, it requires complex indexing and lookups, along with the networking overhead of passing the 32bit token-ids in every event.

  3. Enable Language-Agnostic Interop
    Develop shared guidance and reference libraries in Python, Go, and other widely used languages. These utilities do not need to reside within vLLM, but should remain consistent with its specifications.

Proposed Change

This RFC proposes standardizing two core aspects of KV-Cache awareness:

1. KVEvents Schema

  • Reuse the existing KVEvents format used internally in vLLM as a versioned public interface for any KV-Cache publisher or consumer
  • Consider light refactoring:
    • Use bytes for hashes instead of Python-native int
    • Reduce required fields where appropriate

2. Prefix Block Hashing

These changes will support consistent block identity and event interpretation across runtimes, enabling robust interop between cache indexers, and routing layers.

CC List

@robertgshaw2-redhat @njhill @YaoJiayi @dannyharnik @orozery

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions