-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Hybrid Allocator] Support full attention with different hidden size #25101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hybrid Allocator] Support full attention with different hidden size #25101
Conversation
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for models with varying hidden sizes across layers by adding UniformTypeKVCacheSpecs. This is a valuable feature, particularly for speculative decoding and certain sparse attention models. The changes are well-structured, with necessary refactoring in the model runner and the addition of relevant tests. I have identified one issue where a configuration option is not being applied in the new code path, which could lead to incorrect behavior.
| num_blocks = available_memory // kv_cache_groups[ | ||
| 0].kv_cache_spec.page_size_bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The num_gpu_blocks_override configuration is not being respected in this new code path for UniformTypeKVCacheSpecs. The calculation for num_blocks should use the get_num_blocks helper function, similar to the else branch, to ensure that user-provided overrides are applied correctly.
page_size = kv_cache_groups[0].kv_cache_spec.page_size_bytes
num_blocks = get_num_blocks(vllm_config,
1,
available_memory=available_memory,
page_size=page_size)| attn_backend, | ||
| ) | ||
|
|
||
| key = attn_backend.full_cls_name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LucasWilkinson Why didn't we add kv_cache_spec as a key to split kv cache groups to attention groups?
|
|
||
| # All workers have the same kv_cache_config except layer names, so use | ||
| # an arbitrary one to initialize the scheduler. | ||
| assert all([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assert is in generate_scheduler_kv_cache_config now.
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
### What this PR does / why we need it? Follow up `UniformTypeKVCacheSpecs` changes introduced by vllm-project/vllm#25101, which support different hidden size in uniform type kvcache specs This also fix the CI issue about `TypeError: AttentionGroup.__init__() missing 1 required positional argument: 'kv_cache_spec'` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Tests passed with exsiting e2e tests. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 --------- Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it? Follow up `UniformTypeKVCacheSpecs` changes introduced by vllm-project/vllm#25101, which support different hidden size in uniform type kvcache specs This also fix the CI issue about `TypeError: AttentionGroup.__init__() missing 1 required positional argument: 'kv_cache_spec'` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Tests passed with exsiting e2e tests. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? Follow up `UniformTypeKVCacheSpecs` changes introduced by vllm-project/vllm#25101, which support different hidden size in uniform type kvcache specs This also fix the CI issue about `TypeError: AttentionGroup.__init__() missing 1 required positional argument: 'kv_cache_spec'` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Tests passed with exsiting e2e tests. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…llm-project#25101) Signed-off-by: Chen Zhang <zhangch99@outlook.com>
…llm-project#25101) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: charlifu <charlifu@amd.com>
…25101) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…llm-project#25101) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…llm-project#25101) Signed-off-by: Chen Zhang <zhangch99@outlook.com>
…llm-project#25101) Signed-off-by: Chen Zhang <zhangch99@outlook.com>
### What this PR does / why we need it? Follow up `UniformTypeKVCacheSpecs` changes introduced by vllm-project/vllm#25101, which support different hidden size in uniform type kvcache specs This also fix the CI issue about `TypeError: AttentionGroup.__init__() missing 1 required positional argument: 'kv_cache_spec'` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Tests passed with exsiting e2e tests. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 --------- Signed-off-by: MengqingCao <cmq0113@163.com>
…llm-project#25101) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Hybrid allocator requires all layers have the same kv_hidden_size. But some model breaks this assumption, e.g.,
If the model only have one type of kv cache (e.g., all layers are full attention / all layers are sliding window with the same window size), KVCacheManager can regard this model as only having one layer of that type (i.e., only one KVCacheGroup with all layers in it).
This PR supports this kind of KVCacheGroup by introducing a
UniformTypeKVCacheSpecsThe model runner is updated to support different kv_cache_specs within one group. The layers will be split to different attn_groups and have different attention metadata builder based on their kv_cache_spec.
Test Plan
Initialize the model with
and run basic.py
Test Result
Can generate meaningful result
Fix #22432
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.