- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 11k
 
Description
🚀 The feature, motivation and pitch
I would like the /pooling endpoint to support prefix caching for hidden states.
Background
The /pooling endpoint is designed to extract hidden states/embeddings by performing a full prefill pass over all input tokens. However, it currently doesn't support prefix caching - every request recomputes all tokens from scratch, even for repeated prefixes.
Feature Request:
Enable prefix caching for the /pooling endpoint, so that:
- Hidden states for cache-hit tokens are retrieved from cache (not recomputed)
 - Only new/uncached tokens need computation
 - The complete hidden states (cached + newly computed) are returned
 
Why This Matters:
Many applications process the same prefixes repeatedly (system prompts, instruction templates, etc.):
- Without hidden state caching: every 
/poolingrequest recomputes the entire sequence - With hidden state caching: reuse cached hidden states -> only compute new tokens -> much better throughput
 
Alternatives
Currently, the only option is to use /pooling without prefix caching, which results in high latency for repeated prefixes.
Additional context
Related issues:
This feature would require caching hidden states alongside KV cache, sharing the same prefix matching logic and eviction policy.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.