-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Fix non-contiguous input passed to Marlin kernel #15319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix non-contiguous input passed to Marlin kernel #15319
Conversation
Signed-off-by: Qubitium <qubitium@modelcloud.ai>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
I have make this same fix in a previous PR: #14946 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I prefer this to #14946, so that we dont subject other code paths without this requirement (namely block fp8) to a potentially slow copy
I think you are right, this PR seems a more general solution. |
vllm/model_executor/layers/quantization/kernels/mixed_precision/marlin.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it prefix caching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao Is it prefix or prefill? I am confused myself or both? Let me know which is the correct terminology to use in comment and I will update. Prefix is the cache, prefill is the action that extracts the prefix cache. Is this correct?
The stacktrace shows it going through the prefill code path:
ERROR 03-22 02:12:54 [core.py:343] File "/root/vllm/vllm/v1/attention/backends/mla/common.py", line 929, in forward
ERROR 03-22 02:12:54 [core.py:343] output[num_decode_tokens:] = self._forward_prefill(
ERROR 03-22 02:12:54 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-22 02:12:54 [core.py:343] File "/root/vllm/vllm/v1/attention/backends/mla/common.py", line 826, in _forward_prefill
ERROR 03-22 02:12:54 [core.py:343] context_output, context_lse = self._compute_prefill_context( \
ERROR 03-22 02:12:54 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry, i just saw the code and i thought "prefill caching" is typo. i'm not familiar with the mla code path though. cc @LucasWilkinson to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao I have updated comment. I believe you are correct.
|
I worry about this non-contiguous issue for other kernels, but for performance reasons you are right to keep it local to the kernel. Will just have to audit other kernels. It would be good to make a test specifically for this that quant kernels can accept non-contiguous input |
Signed-off-by: Qubitium <qubitium@modelcloud.ai>
89a2cba to
d3ea7fe
Compare
|
NOTE: Another (more performant) fix in flight: #14658 (#14658 (comment)) |
This reverts commit d20e261.
Signed-off-by: Wes Medford <wryanmedford@gmail.com>
…t#15319)" (vllm-project#15398) Signed-off-by: Wes Medford <wryanmedford@gmail.com>
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
…t#15319)" (vllm-project#15398) Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
…t#15319)" (vllm-project#15398) Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
With vllm
main+ DeekSeek R1 GPTQModel quantized model +Marlin+Triton MLA V1+ kv/prefill caching == Crash due to x (input) not contiguous.Core cause: The kv/prefill caching is passing non-contiguous x (input) to lower kernels and some (Marlin) kernels only operate on contiguous memory layout.
Reproduce: Execute the same prompt/request twice and it will crash with below stacktrace.
I only fixed the Marlin code path. Not sure if other kernels are affected or if my fix location should be moved upwards in the logic chain.
@mgoin