Skip to content

[RFC]: Cache Salting for Secure and Flexible Prefix Caching in vLLM #16016

@dr75

Description

@dr75

Motivation.

vLLM’s prefix caching (APC) improves inference performance by reusing previously computed KV cache accross requests. The prefix cache operates by hashing input in fixed-size blocks (typically 16 tokens), and can use strong hashing (e.g., SHA-256) to guard against hash collisions.

However, as demonstrated in Leaking Secrets from Prefix Caches, the cache remains vulnerable to timing-based side-channel attacks. An attacker can infer cache reuse by guessing popular inputs and measuring latency — potentially compromising privacy especially in shared or multi-tenant environments.

What’s missing in vLLM today is a simple, effective way to segment cache reuse across groups while preserving performance. Note that other providers do also limit cache sharing, e.g., OpenAI is limiting cache sharing to organizations.

Proposed Change.

We propose to salt block hashes by introducing an optional top-level field in the request schema, which allows users to set the salts to be used for hashing per request. By sharing the salt, users can enable prefix caching while protecting the prompt

We analyzed two different designs, which have different benefits and drawbacks.

(A) Single-barrier design - one salt per request

{
  messages=[{
    "role": "system",
    "content": "You are a helpful assistant."
  },  {
    "role": "user",
    "content": "Here is a document with details about the world series: ...",
  }, {
    "role": "user",
    "content": "Who won the world series in 2020?",
  }],
  "cache_salt": "Z3V2bmV3aGxza3ZubGFoZ3Zud3V3ZWZ2bmd0b3V2bnZmc2xpZ3RoZ2x2aQ=="
}

If cache_salt is provided, this value:

  • Is injected into the hash of the first block of a request.
  • Ensures that cache reuse is restricted to requests using the same salt.
  • Prevents cache state inference across salt boundaries, i.e., every user who provides the same salt can use the cache and is protected against timing attacks from users who don't share the salt.

The single-barrier design protects requests that provide a salt value while allowing cache reuse between requests that use the same salt.
The single-barrier design is easy to use and implement in vLLM but comes with limited flexibility as it does not allow cache reuse between users while protecting user prompts at the same time.

(B) Muli-barrier design - per message salt by message index

To overcome the limitation of design (A), we propose an alternative design that allows an arbitrary number of salts, each salt assigned to a certain message.

{
  messages=[{
    "role": "system",
    "content": "You are a helpful assistant."
  },  {
    "role": "user",
    "content": "Here is a document with details about the world series: ...",
  }, {
    "role": "user",
    "content": "Who won the world series in 2020?",
  }],
  "cache_salt_map": {
    "1": "org-salt",
    "2": "user-salt"
  }
}

If cache_salt_map is provided, each salt:

  • Is injected into the hash of the corresponding message defined by the message index.
    • The system message (index 0) is not protected by a salt such that the corresponding KV cache is reused accross all users.
    • "org-salt" is assigned to message 1 (zero-based index; "Here is a document ..."). Only requests that use "org-salt" can use the KV cache of the document. The message (and following messages) is protected against attacks from users outside the org that don't have acces to the salt.
    • "user-salt" is assigned to message 2 ("Who won ..."). Only requests with "user-salt" reuse the cache of this and following messages. Prompts are protected against attacks even from within the same org and by attackers with access to the shared org-salt.
  • Ensures that cache reuse is restricted to prompt prefixes that share the same salt.
  • Prevents cache state inference across salt boundaries.
  • Allows for cache reuse between a group of users while protecting the user's or subgroups prompts, up to an arbitrary hierarchy of user groups. This includes protecting most sensitive information by keeping salts local without sharing.

Comparison

(A) Single-barrier design

  • Single salt means only one barrier per request.
    • Allows for cache sharing & prompt protection within a group of users OR for a single user, but not both (e.g., share doc but protect question).
    • No hierarchical cache reuse (stages of shared caches).
    • It practically disables large scale cache reuse if small scale protection is required.
  • Simple API
  • Simple implementation as only one salt needs to be added to the first block.

(B) Multi-barrier design

  • High flexibility
    • Hierarchical cache reuse (stages of shared caches) and prompt protection (stages of security).
    • E.g., allows for cache reuse accross users AND user prompt protection at the same time (e.g., share doc but protect question).
  • More complex API with salt-message mapping.
  • Complex implementation as the mapping of salts to messages has to be propagated to tokens through chat template rendering to be able to inject the salt into the block hash at the right locations.

Design Details

  • Salts are base64-encoded 256-bit (32-byte) strings.
  • If no salt is provided, default global caching behavior applies.
  • Salts are not passed to the model but only used internally for block hashing.
  • Only the first hashed block of a message (or globally for the single-barrier design) is directly affected. The following blocks are protected due to inclusion of the predecessor hash.

Implementation Considerations

  • If cache_salt is omitted, the cache is globally shared (status quo), not causing any performance impact.
  • With both solutions, salts can be generated by clients per different user groups. E.g.,
    • per organization or group of users, enabling cache reuse inside that group
    • per user, isolating a user from other users of the same org, while allowing reuse of the user's cache
    • Only the multi-barrier design supports reuse within one group while protecting subgroups or users.
  • Service providers that use vLLM to offer their service can apply a default salting scheme, e.g., per organization, while still allowing flexibility by overriding the salt as needed.

Conclusion

As the multi-barrier design allows for protecting user prompts, while sharing cache between users, we propose to implement the multi-barrier design (option B) in the long run. Due to the complexity of the implementation (e.g., forwarding message boundaries through template rendering), we would implement it with an iterative approach starting with the single-barrier design and adding the multi-barrier implementation afterwards.

Both designs are incompatible in terms of the API (assuming a union of string and dict types for the same field is not desired). To be able to provide both in the API (to first implement A and later B), there are two options:

  • use two different request fields (cache_salt and cache_salt_map) for both designs
  • start with cache_salt_map (or just cache_salts) but allow only a single value ("0" : "<SALT>") until the multi-barrier design is implemented.

We suggest the latter approach to keep the API simpler in the long run.

Feedback Period.

One week from April 3, 2025.

CC List.

@comaniac

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions