Skip to content

Conversation

@LucasWilkinson
Copy link
Collaborator

Improvement to address: vllm-project/vllm#18619 (comment)

When running the combine with large batch that is almost entirely decode with 1 prefill the previous grid was excessively large making the combine kernel take a long time.

Before this PR the grid size for combine would be cdiv(max_seqlen_q * num_heads, kBlockM) x batch_size after this PR its (cdiv(total_q * num_heads, kBlockM) + batch_size) x 1 which scales much better for large batches that are primarily made up of decodes.

e.g. if we have a batch of 256 where the q_seqlens are [600] + [1] * 255, (assuming num_heads 8 and kBlockM 8)

before this PR the grid would be:
cdiv(600 * 8, 8) x 256 = 153600

after this PR the grid is:
cdiv(855 * 8, 8) + 256 x 1 = 1111

Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me. Optimization makes sense. Nice work.

Should we try to push this upstream?

@LucasWilkinson
Copy link
Collaborator Author

Ya I'm going to make an upstream PR

LucasWilkinson and others added 11 commits June 16, 2025 18:04
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/varlen-combine-scheduler branch from 604050e to 566d676 Compare June 16, 2025 18:05
@LucasWilkinson LucasWilkinson merged commit 2c6bcfc into main Jun 16, 2025
1 check passed
zyongye pushed a commit to zyongye/flash-attention that referenced this pull request Aug 7, 2025
* varlen combine scheduler

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* cleanup

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* move check

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* standard scheduling algo

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* better heuristic

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* better comments

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* cleanup

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* cleanup

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* put in a more readable heurisitic

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Apply suggestions from code review

Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* FA2 8.0 PTX (vllm-project#69)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

---------

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
LucasWilkinson added a commit that referenced this pull request Aug 7, 2025
* varlen combine scheduler

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* cleanup

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* move check

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* standard scheduling algo

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* better heuristic

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* better comments

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* cleanup

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* cleanup

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* put in a more readable heurisitic

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Apply suggestions from code review

Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* FA2 8.0 PTX (#69)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

---------

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
jayhshah pushed a commit that referenced this pull request Aug 8, 2025
* varlen combine scheduler

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* cleanup

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* move check

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* standard scheduling algo

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* better heuristic

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* better comments

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* cleanup

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* cleanup

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* put in a more readable heurisitic

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Apply suggestions from code review

Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* FA2 8.0 PTX (#69)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

---------

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Signed-off-by: Jay Shah <jayhshah@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants