-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Model][Qwen3VL] Compute cu_seqlens on CPU to remove
#26496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
f0bd894 to
2f5ab2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to improve performance by making tensor copies non-blocking and moving cu_seqlens computation to the CPU. The changes are generally well-implemented and align with the goal of reducing synchronization overhead. However, there is a critical issue where a tensor is moved to the GPU without reassigning the result to the variable, which will lead to either a runtime error or negate the intended performance benefit. I've left a specific comment with a suggested fix.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
…26496) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
…26496) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…26496) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…26496) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…26496) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…26496) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…26496) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Purpose
The current code does 5
cudaStreamSynchronize( 3 syncs due to CPU-GPU weight copy and 2 syncs insidetorch.repeat_interleave) which isn't idea.This PR removes the blocking weight copy and moves the
cu_seqlensto the CPU which is faster and doesn't require any CPU-GPU syncs.Before:
After:
Test Plan
On a single H100
Test Result
Before:
After:
This reduces Mean TTFT by 1 - 2 %.