-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[BugFix] Fix ChunkedLocalAttention when the hybrid kv-cache is disabled #21707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Fix ChunkedLocalAttention when the hybrid kv-cache is disabled #21707
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR provides a temporary fix for an issue with chunked local attention when the hybrid KV cache is disabled, which seems to work based on the test results. However, I've found a critical issue in the implementation of the fix that could lead to incorrect behavior for models with multiple KV cache groups. The fix is applied inside a loop and could overwrite correct metadata with incorrect data depending on the processing order of the groups. I've provided a suggestion to correct this.
| if self.attention_chunk_size is not None \ | ||
| and self.scheduler_config.disable_hybrid_kv_cache_manager: | ||
| if not hasattr(self, "local_attention_layers"): | ||
| self.local_attention_layers = [] | ||
| attn_layers = get_layers_from_vllm_config( | ||
| self.vllm_config, Attention) | ||
| for layer_name, attn_module in attn_layers.items(): | ||
| if attn_module.use_irope: | ||
| self.local_attention_layers.append(layer_name) | ||
|
|
||
| local_attn_metadata_i = (builder.build( | ||
| common_prefix_len=0, | ||
| common_attn_metadata=make_local_attention_virtual_batches( | ||
| self.attention_chunk_size, common_attn_metadata, | ||
| self.cache_config.block_size), | ||
| )) | ||
|
|
||
| for layer_name in self.local_attention_layers: | ||
| attn_metadata[layer_name] = local_attn_metadata_i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This temporary fix has a potential correctness issue. It's placed inside the loop over kv_cache_groups. In each iteration, it rebuilds metadata and overwrites it for all layers in self.local_attention_layers.
If there are multiple kv_cache_groups (e.g., one for full attention and one for local attention), the metadata for local attention layers will be determined by the context of the last group processed in the loop. If the last group is not the local attention group, the metadata will be incorrect. This could lead to correctness issues depending on the model architecture and group processing order.
A safer approach is to scope this logic to only execute for the kv_cache_group that corresponds to chunked local attention. This also simplifies the code by removing the need to build and maintain self.local_attention_layers.
Here is a suggested implementation that addresses this:
| if self.attention_chunk_size is not None \ | |
| and self.scheduler_config.disable_hybrid_kv_cache_manager: | |
| if not hasattr(self, "local_attention_layers"): | |
| self.local_attention_layers = [] | |
| attn_layers = get_layers_from_vllm_config( | |
| self.vllm_config, Attention) | |
| for layer_name, attn_module in attn_layers.items(): | |
| if attn_module.use_irope: | |
| self.local_attention_layers.append(layer_name) | |
| local_attn_metadata_i = (builder.build( | |
| common_prefix_len=0, | |
| common_attn_metadata=make_local_attention_virtual_batches( | |
| self.attention_chunk_size, common_attn_metadata, | |
| self.cache_config.block_size), | |
| )) | |
| for layer_name in self.local_attention_layers: | |
| attn_metadata[layer_name] = local_attn_metadata_i | |
| if (self.attention_chunk_size is not None | |
| and self.scheduler_config.disable_hybrid_kv_cache_manager | |
| and isinstance(kv_cache_group_spec.kv_cache_spec, | |
| ChunkedLocalAttentionSpec)): | |
| # This group is for chunked local attention. | |
| # Rebuild its metadata with make_local_attention_virtual_batches | |
| # and common_prefix_len=0. | |
| local_attn_metadata_i = builder.build( | |
| common_prefix_len=0, | |
| common_attn_metadata=make_local_attention_virtual_batches( | |
| self.attention_chunk_size, common_attn_metadata, | |
| self.cache_config.block_size), | |
| ) | |
| for layer_name in kv_cache_group_spec.layer_names: | |
| attn_metadata[layer_name] = local_attn_metadata_i |
|
@yeqcharlotte has imported this pull request. If you are a Meta employee, you can view this in D79067990. |
…ed (vllm-project#21707) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ed (vllm-project#21707) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ed (vllm-project#21707) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: x22x22 <wadeking@qq.com>
…ed (vllm-project#21707) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ed (vllm-project#21707) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ed (vllm-project#21707) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…ed (vllm-project#21707) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…ed (vllm-project#21707) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…ed (vllm-project#21707) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ed (vllm-project#21707) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Temporary fix for ChunkedLocalAttention when the hybrid kv-cache is disabled while we work on #21588
Test Plan
lm_eval
Test Result
Main - Hybrid KV-cache
Main - No-Hybrid KV-cache
This PR
(Optional) Documentation Update