-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
[Model] Re-support MotifForCausalLM #27396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This reverts commit 3125d79. Signed-off-by: WyldeCat <skan1543@gmail.com>
Signed-off-by: WyldeCat <skan1543@gmail.com>
Signed-off-by: WyldeCat <skan1543@gmail.com>
|
Documentation preview: https://vllm--27396.org.readthedocs.build/en/27396/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request re-introduces support for the Motif model, which involves adding a new PolyNorm layer with its corresponding CUDA kernel, and a GroupedDifferentialAttention backend. The implementations for PolyNorm and the new model architecture appear to be correct and follow best practices. However, I've identified a critical issue in the caching logic within the new GroupedDifferentialAttentionBackend that could lead to incorrect behavior and needs to be addressed.
| if ( | ||
| self.kv_sharing_target_layer_name is None | ||
| and key is not None | ||
| and value is not None | ||
| ): | ||
| # Reshape the input keys and values and store them in the cache. | ||
| # Skip this if sharing KV cache with an earlier attention layer. | ||
| # NOTE(woosuk): Here, key and value are padded while slot_mapping is | ||
| # not padded. However, we don't need to do key[:num_actual_tokens] | ||
| # and value[:num_actual_tokens] because the reshape_and_cache_flash | ||
| # op uses the slot_mapping's shape to determine the number of | ||
| # actual tokens. | ||
| reshape_and_cache_flash( | ||
| key, | ||
| value, | ||
| key_cache, | ||
| value_cache, | ||
| attn_metadata.slot_mapping, | ||
| self.kv_cache_dtype, | ||
| layer._k_scale, | ||
| layer._v_scale, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The forward_single_attention method includes a block for caching key and value tensors. This is problematic because the main forward method already performs the necessary caching for the Grouped Differential Attention (GDA) splits (k1, v1 and k2, v2) using the populate_kv_cache method before forward_single_attention is ever called.
This leads to two significant issues:
- Redundant Caching: The same key-value pairs are cached multiple times, which is inefficient.
- Incorrect Caching: For cross-split attention computations (e.g.,
Attn(q1, K1, V2)),forward_single_attentionis invoked with mismatchedkeyandvaluetensors (likek1andv2). Thereshape_and_cache_flashcall within this method is not designed to handle such cases and will likely corrupt the cache state.
The caching logic should be centralized within the forward method. The forward_single_attention method should then only be responsible for the attention computation, using the already populated cache, without performing any caching itself.
To resolve this, the caching block within forward_single_attention should be removed. The key and value arguments are still necessary for other parts of the function (e.g., descale_shape calculation), so they should remain in the function signature.
|
Hi @jeejeelee , I hope you’re doing well! |
| @@ -0,0 +1,744 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkLight1337 Who can review this attention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Purpose
Changes
PolyNormlayer #27110New model for :
https://huggingface.co/Motif-Technologies/Motif-2-12.7B-Base
https://huggingface.co/Motif-Technologies/Motif-2.6b-v1.1-LC
https://huggingface.co/Motif-Technologies/Motif-2.6B
co-author : @ca1207
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.