[RFC]: Support Layer-Specific KV-Cache Dtype & Layout in V1 Engine for Hybrid Models

### Motivation.

### Problem Statement
Currently, V1’s ConstantCacheManager enforces:

Uniform KV-cache dtype across all layers (e.g., cannot mix BF16 for Linear attention’s shortconv + FP32 for recurrent-cache).
FlashInfer-only layout ([num_blocks, 2, ...]), which breaks MLA (block_size=64) unless FLASHINFER backend is used.
This prevents hybrid models with layer-specific requirements from running efficiently.

### Motivation
We are developing an internal HybridModel that combines a custom Linear-attention variant (with a short convolution path) with Multi-head Latent Attention (MLA). And we successfully integrate our model to v0 engine.
After merging Minimax’s V1-engine updates from last week, we discovered that the current cache manager is too rigid for this topology.

Specifically:

We need BF16 KV-cache for the Linear-attention layers (driven by a short-convolution kernel), but FP32 KV-cache for the recurrent / MLA layers. The V1 engine presently allocates a single, uniform dtype for the entire model, forcing us to over-allocate memory when we default everything to FP32.
The engine also hard-codes the FlashInfer layout [num_blocks, 2, …] and the same block size for every layer. As a result, MLA (which we run with block_size=64) is incompatible unless we pin the backend to FLASHINFER. Skipping the safety checks produces corrupted outputs.
These constraints make it impossible to run the model without either (a) redundant memory usage or (b) a backend lock-in that we would prefer to avoid.

### Proposed Change.

We’d like to open a design discussion on how V1 could allow layer-aware KV-cache policies. At a high level:

Decouple the Linear-attention and standard-attention cache pools so that each can specify its own dtype and layout.
Lift the single-block-size requirement so that different attention mechanisms can select page sizes that match their kernels (e.g., 64 for MLA vs. 1008 for Linear).
Avoid mandating FLASHINFER for all layers when only a subset actually needs it. For example, we use cutlass kernel and triton kernel.
We are intentionally not proposing a concrete patch at this stage; instead, we’d like the community’s input on:

Whether a per-layer metadata mechanism (or another approach) is the right direction.
How to maintain backward compatibility while introducing this flexibility.
Any hidden assumptions in the current block manager that would complicate heterogeneous allocations.
Feedback and alternative ideas are very welcome.

### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Support Layer-Specific KV-Cache Dtype & Layout in V1 Engine for Hybrid Models #23161

Motivation.

Problem Statement

Motivation

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Support Layer-Specific KV-Cache Dtype & Layout in V1 Engine for Hybrid Models #23161

Description

Motivation.

Problem Statement

Motivation

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions