[Hybrid]: Decouple Kernel Block Size from KV Page Size #24486

zhiyuan1i · 2025-09-09T06:52:08Z

Purpose

This PR introduces a hybrid cache architecture that separates logical kernel block size from
physical page size, enabling more flexible memory management. Key changes include:

Added kernel_block_size field to CacheConfig for logical block sizing
Enhanced platform-specific configurations for CUDA and ROCm to support hybrid blocks
Implemented block table conversion logic between physical and logical representations
Added support for different physical/logical block size ratios in V1 worker components

This hybrid model decoupling enables independent development of high-performance operators
without being constrained by linear attention mechanisms like Mamba, addressing performance
bottlenecks discussed in issues #24280 and
#23161.

Test Plan

Added comprehensive tests in tests/v1/worker/test_gpu_model_runner.py to verify:

Block table conversion between physical and logical representations
Proper handling of different block size ratios
Integration with existing GPU model runner functionality
Platform-specific configurations for CUDA and ROCm

Test Result

pytest tests/v1/worker/test_gpu_model_runner.py - 20 passes

tests/v1/worker/test_gpu_model_runner.py ....................                                                                                                                        [100%]

===================================================================================== warnings summary =====================================================================================
../../../../opt/conda/envs/vllm-upstream/lib/python3.12/site-packages/torch/cuda/__init__.py:63
  /opt/conda/envs/vllm-upstream/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================== 20 passed, 3 warnings in 89.20s (0:01:29) =========================================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a hybrid cache architecture to decouple logical and physical block sizes, which is a significant enhancement for memory management. The changes span configuration, platform-specific code, and the core block table management. The implementation in block_table.py appears solid. However, I've identified some critical issues in the tests intended to validate this new functionality. The tests are flawed and do not correctly verify the hybrid block logic, which could mask bugs. Additionally, there's a piece of logic in the GPUModelRunner that could be made more robust. My review focuses on fixing these test and implementation issues to ensure the new feature is reliable and well-tested.

tests/v1/worker/test_gpu_model_runner.py

vllm/v1/worker/gpu_model_runner.py

heheda12345 · 2025-09-09T07:30:44Z

Also CC @tdoublep

heheda12345

Discussed with @zhiyuan1i offline. Two major concerns:

I prefer to calculate kernel block size for each attention backend in gpu_model_runner
would be great if BlockTable.block_table and BlockTable.physical_block_table can be merged into one tensor.

zhiyuan1i · 2025-09-09T09:31:44Z

@heheda12345 Thanks for the prompt feedback! I’ve addressed suggestion2 and merged BlockTable.block_table and BlockTable.physical_block_table into a single tensor as recommended. :)

vllm/platforms/cuda.py

vllm/v1/worker/gpu_model_runner.py

vllm/v1/worker/block_table.py

tjtanaa · 2025-09-11T03:52:24Z

CC @gshtras @hongxiayang as this also affect ROCm

vllm/v1/worker/block_table.py

Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>

heheda12345

LGTM! Thanks for this enhancement. Follow-ups:

more clean-ups @heheda12345
verify the get_supported_kernel_block_size of each attention backend.

vllm/model_executor/models/config.py

heheda12345 · 2025-10-09T06:16:52Z

vllm/v1/worker/gpu_model_runner.py

                else:
                    self.reorder_batch_threshold = reorder_batch_threshold_i

+    def _find_compatible_block_sizes(


(not a blocker) this function may be simplified.

heheda12345 · 2025-10-09T06:21:50Z

vllm/v1/worker/gpu_model_runner.py

                num_blocks = raw_tensor.numel() // kv_cache_spec.page_size_bytes
                if isinstance(kv_cache_spec, AttentionSpec):
                    has_attn = True
+                    kv_manager_block_size = kv_cache_spec.block_size


(not a blocker) should we use the common block size of all attention groups in the same kv cache group here?

…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits) [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001) Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164) Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353) [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189) Bump Flashinfer to v0.4.0 (vllm-project#26326) Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464) [Core] Relax the LoRA max rank (vllm-project#26461) [CI/Build] Fix model nightly tests (vllm-project#26466) [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486) [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926) [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200) [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439) [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462) [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445) [Minor] Change warning->warning_once in preprocess (vllm-project#26455) [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392) [Misc] Redact ray runtime env before logging (vllm-project#26302) Separate MLAAttention class from Attention (vllm-project#25103) [Attention] Register FLASHMLA_SPARSE (vllm-project#26441) [Kernels] Modular kernel refactor (vllm-project#24812) ...

…24486) Signed-off-by: lizhiyuan <uniartisan2017@gmail.com> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

LucasWilkinson · 2025-10-11T14:48:22Z

vllm/v1/attention/backends/flash_attn.py


+    @staticmethod
+    def get_supported_kernel_block_size() -> list[Union[int, MultipleOf]]:
+        return [MultipleOf(16)]


Technically FA3 would support MultipleOf(1) while FA2 would support MultipleOf(16); I dont think its worth handling this though

…24486) Signed-off-by: lizhiyuan <uniartisan2017@gmail.com> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

…24486) Signed-off-by: lizhiyuan <uniartisan2017@gmail.com> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>

…24486) Signed-off-by: lizhiyuan <uniartisan2017@gmail.com> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

zhiyuan1i requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners September 9, 2025 06:52

mergify bot added rocm Related to AMD ROCm v1 labels Sep 9, 2025

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

tests/v1/worker/test_gpu_model_runner.py Outdated Show resolved Hide resolved

tests/v1/worker/test_gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

zhiyuan1i force-pushed the hybrid-cache-groups branch from 4e3eeca to 8f2ee3d Compare September 9, 2025 06:58

heheda12345 reviewed Sep 9, 2025

View reviewed changes

zhiyuan1i force-pushed the hybrid-cache-groups branch from 954ade4 to 0e0823a Compare September 9, 2025 09:28

zhiyuan1i force-pushed the hybrid-cache-groups branch 2 times, most recently from 6d1735e to 0b544bf Compare September 9, 2025 14:43

heheda12345 reviewed Sep 10, 2025

View reviewed changes

vllm/platforms/cuda.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/block_table.py Show resolved Hide resolved

vllm/v1/worker/block_table.py Outdated Show resolved Hide resolved

zhiyuan1i force-pushed the hybrid-cache-groups branch from 62e5072 to 55e2235 Compare September 10, 2025 07:36

heheda12345 reviewed Sep 11, 2025

View reviewed changes

zhiyuan1i force-pushed the hybrid-cache-groups branch from 55e2235 to b0e1d3b Compare September 11, 2025 06:25

zhiyuan1i requested a review from zhuohan123 as a code owner September 11, 2025 06:25

mergify bot added performance Performance-related issues qwen Related to Qwen models gpt-oss Related to GPT-OSS models speculative-decoding kv-connector labels Oct 9, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Oct 9, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Oct 9, 2025

zhiyuan1i force-pushed the hybrid-cache-groups branch from 6951014 to 10fabbb Compare October 9, 2025 03:48

zhiyuan1i changed the title ~~[Hybrid]: Decouple Logical Block Size from Physical Page Size~~ [Hybrid]: Decouple Kernel Block Size from KV Page Size Oct 9, 2025

Merge branch 'main' into hybrid-cache-groups

fba9bea

Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>

mergify bot removed the needs-rebase label Oct 9, 2025

heheda12345 approved these changes Oct 9, 2025

View reviewed changes

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Oct 9, 2025

heheda12345 enabled auto-merge (squash) October 9, 2025 06:24

vllm-bot merged commit d24cf32 into vllm-project:main Oct 9, 2025
58 of 60 checks passed

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Oct 9, 2025

heheda12345 mentioned this pull request Oct 9, 2025

[Hybrid] A simpler algorithm to find kernel_block_size #26476

Open

5 tasks

tomeras91 mentioned this pull request Oct 9, 2025

[Feature]: decouple attention backend block size from KVCacheManager block size #24280

Closed

1 task

benchislett mentioned this pull request Oct 9, 2025

[Bug]: prepare_kernel_block_sizes doesn't parse UniformTypeKVCacheSpecs #26524

Closed

1 task

heheda12345 mentioned this pull request Oct 10, 2025

[deepseek] kernel block size for UniformTypeKVCacheSpecs #26559

Merged

5 tasks

vadiklyutiy mentioned this pull request Oct 10, 2025

[BUG] Qwen3-next MTP. Fix attn metadata build bug #26564

Merged

LucasWilkinson reviewed Oct 11, 2025

View reviewed changes

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#…

4e9806a

…24486) Signed-off-by: lizhiyuan <uniartisan2017@gmail.com> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#…

bf66d14

…24486) Signed-off-by: lizhiyuan <uniartisan2017@gmail.com> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>

This was referenced Oct 24, 2025

[Bug]: Cache malformation in hybrid models with SSM cache dtype float32 and block allocation wrap around #27264

Open

[Bug]: Hybrid Attention models broken after switching to flashinfer 0.4 (tested on Granite 4.0 H, Qwen3-Next, Jamba-3B, Nemotron-H-8b) #26936

Open

Uh oh!

[Hybrid]: Decouple Kernel Block Size from KV Page Size #24486

[Hybrid]: Decouple Kernel Block Size from KV Page Size #24486

Conversation

zhiyuan1i commented Sep 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heheda12345 commented Sep 9, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

zhiyuan1i commented Sep 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tjtanaa commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

heheda12345 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LucasWilkinson Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

zhiyuan1i commented Sep 9, 2025 •

edited by github-actions bot

Loading

tjtanaa commented Sep 11, 2025 •

edited

Loading