Skip to content

[RFC]: Support Layer-Specific KV-Cache Dtype & Layout in V1 Engine for Hybrid Models #23161

@zhiyuan1i

Description

@zhiyuan1i

Motivation.

Problem Statement

Currently, V1’s ConstantCacheManager enforces:

Uniform KV-cache dtype across all layers (e.g., cannot mix BF16 for Linear attention’s shortconv + FP32 for recurrent-cache).
FlashInfer-only layout ([num_blocks, 2, ...]), which breaks MLA (block_size=64) unless FLASHINFER backend is used.
This prevents hybrid models with layer-specific requirements from running efficiently.

Motivation

We are developing an internal HybridModel that combines a custom Linear-attention variant (with a short convolution path) with Multi-head Latent Attention (MLA). And we successfully integrate our model to v0 engine.
After merging Minimax’s V1-engine updates from last week, we discovered that the current cache manager is too rigid for this topology.

Specifically:

We need BF16 KV-cache for the Linear-attention layers (driven by a short-convolution kernel), but FP32 KV-cache for the recurrent / MLA layers. The V1 engine presently allocates a single, uniform dtype for the entire model, forcing us to over-allocate memory when we default everything to FP32.
The engine also hard-codes the FlashInfer layout [num_blocks, 2, …] and the same block size for every layer. As a result, MLA (which we run with block_size=64) is incompatible unless we pin the backend to FLASHINFER. Skipping the safety checks produces corrupted outputs.
These constraints make it impossible to run the model without either (a) redundant memory usage or (b) a backend lock-in that we would prefer to avoid.

Proposed Change.

We’d like to open a design discussion on how V1 could allow layer-aware KV-cache policies. At a high level:

Decouple the Linear-attention and standard-attention cache pools so that each can specify its own dtype and layout.
Lift the single-block-size requirement so that different attention mechanisms can select page sizes that match their kernels (e.g., 64 for MLA vs. 1008 for Linear).
Avoid mandating FLASHINFER for all layers when only a subset actually needs it. For example, we use cutlass kernel and triton kernel.
We are intentionally not proposing a concrete patch at this stage; instead, we’d like the community’s input on:

Whether a per-layer metadata mechanism (or another approach) is the right direction.
How to maintain backward compatibility while introducing this flexibility.
Any hidden assumptions in the current block manager that would complicate heterogeneous allocations.
Feedback and alternative ideas are very welcome.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions