[Feature]: decouple attention backend block size from KVCacheManager block size

### 🚀 The feature, motivation and pitch

Currently, the attention block size (AKA page size) used in the `KVCacheManager` is also used by the attention backend. Yet, in some cases the `KVCacheManager` and the attention backend have contradicting limitations on the attention page size. For example, in hybrid mamba/attention models the `KVCacheManager` requires the mamba and attention page sizes to match, which might cause large attention page sizes (672 tokens for [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2)). On the other hand, the FlashInfer attention backend uses TRTLLM FMHA kernels for Blackwell, which require page size up to 128 (see [here](https://github.com/flashinfer-ai/flashinfer/blob/1649e236e414bb8f0a50b6026cad2c430d1047b2/include/flashinfer/trtllm/fmha/fmhaKernels.cuh#L98)).

The idea is to decouple the two page sizes. If the `KVCacheManager` requires a page size larger than what is supported by the attention backend, we can create multiple pages for the attention backend from a single `KVCacheManager` page. Concretely, if the `KVCacheManager` needs block_size 1024, and attention backend needs block_size 128, we can pass the block_size=1024 pages as 8 block_size=128 pages to the attention backend.

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: decouple attention backend block size from KVCacheManager block size #24280

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: decouple attention backend block size from KVCacheManager block size #24280

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions