-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Bugfix] guard missing attn_metadata in KV scales path #24290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: alexchiu <qiuzhaopeng@foxmail.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cc @LucasWilkinson for another set of eyes
| if self.calculate_kv_scales: | ||
| attn_metadata = get_forward_context().attn_metadata | ||
| if attn_metadata.enable_kv_scales_calculation: | ||
| if (attn_metadata is not None and getattr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zpqiu one question: if enable_kv_scales_calculation=True but not set during compilation, wouldn't attention metadata possibly be None during the profile_run (which also triggers compilation) and then the graph is compiled without this, meaning it never runs even if later calculation is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for piecewise cudagraphs, the dummy run runs without attention metadata https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_model_runner.py#L2535
So the initial compile run has no attention metadata. This means that yes the graph will get compiled without this and it will be wrong later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the kv scales get computed on the first real input only and are then used in subsequent inputs.
To actually fix this, I think what we need is that the first real input should run without torch.compile and CUDAGraphs. All subsequent inputs should run with torch.compile and CUDAGraphs.
Then we need to actually make sure the torch.compile'd graph includes the kv scales.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for piecewise cudagraphs, the dummy run runs without attention metadata https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_model_runner.py#L2535
So the initial compile run has no attention metadata. This means that yes the graph will get compiled without this and it will be wrong later on
Thanks for pointing this out—you’re right. I printed the QKV scale values in vllm/v1/attention/backends/flash_attn.py forward() function, and they’re all the default 1.0, which suggests the dynamic scale computation didn’t take effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the kv scales get computed on the first real input only and are then used in subsequent inputs.
To actually fix this, I think what we need is that the first real input should run without torch.compile and CUDAGraphs. All subsequent inputs should run with torch.compile and CUDAGraphs.
Then we need to actually make sure the torch.compile'd graph includes the kv scales.
Got it—I’ll try that approach. I’ll first sort out the profiling run logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to clarify the question before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works, please see comment
|
Another PR related to kv scale. Can anyone help to review it? #23906 |
|
See #21640 for info and to stay updated on the fix |
Purpose
Fix a bug in Attention.forward where
attn_metadata.enable_kv_scales_calculationwas read unconditionally. During warmup/eager/graph-capture paths the metadata can beNone, which caused:enforce_eager=False)enforce_eager=True)This change treats missing/invalid metadata as disabled (
False) for KV scale calculation, avoiding data-dependent control flow andNoneaccess. When metadata is properly constructed and injected via FlashAttention metadata, KV scale calculation remains enabled.Resolves: #21640
Test Plan
Manual verification with FP8 KV cache and
calculate_kv_scales=True:Test Result
Before this change:
attn_metadata.enable_kv_scales_calculation.NoneTypeforattn_metadata.After this change:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.