-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Core] Separate out attention metadata building logic from prepare inputs #26764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Separate out attention metadata building logic from prepare inputs #26764
Conversation
3bdfe10 to
e48ab54
Compare
fhl2000
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @LucasWilkinson, thank you for refactoring this! Just left one question, otherwise looking great.
db8f24c to
e9387ea
Compare
fhl2000
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! Any chance to also separate the attn_metadata building logic into a function for the Eagle drafter? So we can later reuse them for both the drafter execution and its dummy run (prepared for full CUDA graph support).
6d1a1d4 to
4787566
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2814f5f to
651b9de
Compare
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nothing clearly stood out as missed
…puts (vllm-project#26764) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Preparatory refactor PR for: #24002 (#23789)
Separate out the attention metadata building logic from prepare inputs so we can re-use it for dummy_runs and to setup for padding for cuda-graphs before building attention metadata (see #23789 for motivation)
Moving
_compute_cascade_attn_prefix_lensout of metadata building loops since whether or not we are using cascade attention is needed by cudagraph dispatcher that will happen before attention metadata building in #24002.NOTE: renamed
kv_cache_group_idtokv_cache_gid(common acronym in linux) to make more lines fit within the formatter max line width