-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[Hybrid]: Decouple Kernel Block Size from KV Page Size #24486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a hybrid cache architecture to decouple logical and physical block sizes, which is a significant enhancement for memory management. The changes span configuration, platform-specific code, and the core block table management. The implementation in block_table.py appears solid. However, I've identified some critical issues in the tests intended to validate this new functionality. The tests are flawed and do not correctly verify the hybrid block logic, which could mask bugs. Additionally, there's a piece of logic in the GPUModelRunner that could be made more robust. My review focuses on fixing these test and implementation issues to ensure the new feature is reliable and well-tested.
4e3eeca to
8f2ee3d
Compare
|
Also CC @tdoublep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @zhiyuan1i offline. Two major concerns:
- I prefer to calculate kernel block size for each attention backend in gpu_model_runner
- would be great if
BlockTable.block_tableandBlockTable.physical_block_tablecan be merged into one tensor.
954ade4 to
0e0823a
Compare
|
@heheda12345 Thanks for the prompt feedback! I’ve addressed suggestion2 and merged BlockTable.block_table and BlockTable.physical_block_table into a single tensor as recommended. :) |
6d1735e to
0b544bf
Compare
62e5072 to
55e2235
Compare
|
CC @gshtras @hongxiayang as this also affect ROCm |
55e2235 to
b0e1d3b
Compare
6951014 to
10fabbb
Compare
Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for this enhancement. Follow-ups:
- more clean-ups @heheda12345
- verify the
get_supported_kernel_block_sizeof each attention backend.
| else: | ||
| self.reorder_batch_threshold = reorder_batch_threshold_i | ||
|
|
||
| def _find_compatible_block_sizes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not a blocker) this function may be simplified.
| num_blocks = raw_tensor.numel() // kv_cache_spec.page_size_bytes | ||
| if isinstance(kv_cache_spec, AttentionSpec): | ||
| has_attn = True | ||
| kv_manager_block_size = kv_cache_spec.block_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not a blocker) should we use the common block size of all attention groups in the same kv cache group here?
…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits) [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001) Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164) Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353) [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189) Bump Flashinfer to v0.4.0 (vllm-project#26326) Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464) [Core] Relax the LoRA max rank (vllm-project#26461) [CI/Build] Fix model nightly tests (vllm-project#26466) [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486) [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926) [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200) [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439) [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462) [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445) [Minor] Change warning->warning_once in preprocess (vllm-project#26455) [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392) [Misc] Redact ray runtime env before logging (vllm-project#26302) Separate MLAAttention class from Attention (vllm-project#25103) [Attention] Register FLASHMLA_SPARSE (vllm-project#26441) [Kernels] Modular kernel refactor (vllm-project#24812) ...
…24486) Signed-off-by: lizhiyuan <uniartisan2017@gmail.com> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
|
|
||
| @staticmethod | ||
| def get_supported_kernel_block_size() -> list[Union[int, MultipleOf]]: | ||
| return [MultipleOf(16)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically FA3 would support MultipleOf(1) while FA2 would support MultipleOf(16); I dont think its worth handling this though
…24486) Signed-off-by: lizhiyuan <uniartisan2017@gmail.com> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
…24486) Signed-off-by: lizhiyuan <uniartisan2017@gmail.com> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>
…24486) Signed-off-by: lizhiyuan <uniartisan2017@gmail.com> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>
…24486) Signed-off-by: lizhiyuan <uniartisan2017@gmail.com> Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
This PR introduces a hybrid cache architecture that separates logical kernel block size from
physical page size, enabling more flexible memory management. Key changes include:
This hybrid model decoupling enables independent development of high-performance operators
without being constrained by linear attention mechanisms like Mamba, addressing performance
bottlenecks discussed in issues #24280 and
#23161.
Test Plan
Added comprehensive tests in tests/v1/worker/test_gpu_model_runner.py to verify:
Test Result
pytest tests/v1/worker/test_gpu_model_runner.py - 20 passes
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.