Skip to content

[Feature]: Support Prefix Caching for Hidden States (Pooling Endpoint) #26839

@Risc-lt

Description

@Risc-lt

🚀 The feature, motivation and pitch

I would like the /pooling endpoint to support prefix caching for hidden states.

Background

The /pooling endpoint is designed to extract hidden states/embeddings by performing a full prefill pass over all input tokens. However, it currently doesn't support prefix caching - every request recomputes all tokens from scratch, even for repeated prefixes.

Feature Request:

Enable prefix caching for the /pooling endpoint, so that:

  • Hidden states for cache-hit tokens are retrieved from cache (not recomputed)
  • Only new/uncached tokens need computation
  • The complete hidden states (cached + newly computed) are returned

Why This Matters:

Many applications process the same prefixes repeatedly (system prompts, instruction templates, etc.):

  • Without hidden state caching: every /pooling request recomputes the entire sequence
  • With hidden state caching: reuse cached hidden states -> only compute new tokens -> much better throughput

Alternatives

Currently, the only option is to use /pooling without prefix caching, which results in high latency for repeated prefixes.

Additional context

Related issues:

This feature would require caching hidden states alongside KV cache, sharing the same prefix matching logic and eviction policy.

Before submitting a new issue...

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions