-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[Attention] MLA with chunked prefill #12639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Attention] MLA with chunked prefill #12639
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
f939824 to
77be9af
Compare
77be9af to
bf6a400
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
463e453 to
c542cc4
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
727b265 to
c2d5468
Compare
7bffc5c to
de3474d
Compare
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
vllm/engine/arg_utils.py
Outdated
| # For multimodal models and models with MLA, chunked prefill is | ||
| # disabled by default in V0, but enabled by design in V1 | ||
| if model_config.is_multimodal_model and model_config.use_mla: | ||
| if model_config.is_multimodal_model or model_config.use_mla: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok yeah that makes sense for some of the red tests
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @LucasWilkinson thx for ur wonderful work! |
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Patrick Horn <patrick.horn@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Patrick Horn <patrick.horn@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Patrick Horn <patrick.horn@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
| # Max sure there is enough for 8 full length request or at least | ||
| # 4 pages of cache per request | ||
| max( | ||
| 8 * self.model_config.max_model_len, 4 * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LucasWilkinson A dumb question, how are these magic number(8 full length request & 4 pages) been decided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arbitrarily; just whatever seemed reasonable, we want want it so under common loads theres enough work space that we don't have to chunk the context but not so large that we materially impact KV-cache space. Feel free to tune these and open a PR if you come up with better numbers; we'd appreciate the contribution!
Need to do more benchmarking to see if this makes sense to be on by default in V0, but lays the groundwork for a V1 implementation. (#13111 may help performance)
Shout-out to @pathorn for assisting with hardening this PR
Future work:
self.kv_b_proj(kv_c_normed)in the profile run