Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion vllm/attention/layer.py
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,8 @@ def forward(
"""
if self.calculate_kv_scales:
attn_metadata = get_forward_context().attn_metadata
if attn_metadata.enable_kv_scales_calculation:
if (attn_metadata is not None and getattr(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zpqiu one question: if enable_kv_scales_calculation=True but not set during compilation, wouldn't attention metadata possibly be None during the profile_run (which also triggers compilation) and then the graph is compiled without this, meaning it never runs even if later calculation is enabled?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for piecewise cudagraphs, the dummy run runs without attention metadata https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_model_runner.py#L2535

So the initial compile run has no attention metadata. This means that yes the graph will get compiled without this and it will be wrong later on

Copy link
Collaborator

@zou3519 zou3519 Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the kv scales get computed on the first real input only and are then used in subsequent inputs.

To actually fix this, I think what we need is that the first real input should run without torch.compile and CUDAGraphs. All subsequent inputs should run with torch.compile and CUDAGraphs.

Then we need to actually make sure the torch.compile'd graph includes the kv scales.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for piecewise cudagraphs, the dummy run runs without attention metadata https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_model_runner.py#L2535

So the initial compile run has no attention metadata. This means that yes the graph will get compiled without this and it will be wrong later on

Thanks for pointing this out—you’re right. I printed the QKV scale values in vllm/v1/attention/backends/flash_attn.py forward() function, and they’re all the default 1.0, which suggests the dynamic scale computation didn’t take effect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the kv scales get computed on the first real input only and are then used in subsequent inputs.

To actually fix this, I think what we need is that the first real input should run without torch.compile and CUDAGraphs. All subsequent inputs should run with torch.compile and CUDAGraphs.

Then we need to actually make sure the torch.compile'd graph includes the kv scales.

Got it—I’ll try that approach. I’ll first sort out the profiling run logic.

attn_metadata, "enable_kv_scales_calculation", False)):
self.calc_kv_scales(query, key, value)
if self.use_output:
output_shape = (output_shape
Expand Down