-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Hardware][Intel GPU] Add V1 engine support and chunked_prefill kernel
#14612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
80b3c6f to
72e4bd5
Compare
f927e79 to
8d1a129
Compare
|
I'm attempting to use this PR but every build I do, it ends up failing by telling me "Failed to import from vllm._C", then it claims to not be able to detect a GPU. |
can you show me your build command and logs? or can you build & run xpu path without this PR? |
I use the commands from the vLLM XPU installation instructions to build from source. First setup a UV environment, then install the required packagegs from the requirements/xpu.txt, then build and install vLLM with the XPU backend. Then, after I install IPEX 2.6. I run the application with: And it spits out: |
|
@HumerousGorgon sorry can you create an issue so we can discuss here. I don't want to mess up further code review here. |
#14747 |
01acd52 to
68130be
Compare
68130be to
eed1960
Compare
chunked_prefill kernelchunked_prefill kernel
eed1960 to
41a5df8
Compare
|
@youkaichao @WoosukKwon Can you or someone else can take a review? thanks! |
41a5df8 to
b1008bd
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is a comprehensive pull request that introduces V1 engine support for Intel GPUs, leveraging the chunked_prefill kernel from IPEX. The changes are well-structured, touching upon CI, Docker, requirements, core vLLM operations, platform-specific logic, and new V1 worker/runner components for XPU.
The PR description is excellent, providing clear rationale for design choices, performance benchmarks, and a roadmap for testing. This greatly aids in understanding the scope and impact of the changes.
Overall, the additions seem robust and well-integrated with the existing V1 architecture. I have a couple of points regarding the _ipex_ops.py and ipex_attn.py files that could enhance clarity and ensure correctness, detailed in the comments below. Great work on this significant feature enablement!
Summary of Findings
- Parameter Usage in
_ipex_ops.py::chunked_prefill: Several parameters in thechunked_prefillwrapper seem unused or hardcoded when calling the underlying IPEX kernel. Clarification or simplification could improve maintainability. - Type Consistency for Scale Factors in
ipex_attn.py: There's a potential type mismatch fork_scaleandv_scaleparameters passed toreshape_and_cache_flash. While likely benign in the current non-FP8 context, confirming the IPEX op's behavior with tensor inputs for these float-expected parameters is advisable.
Merge Readiness
The pull request is substantial and introduces significant new functionality for Intel GPU support. The code is generally well-written and the rationale is clearly explained. The identified points are of medium severity and primarily relate to clarity and ensuring robustness of the interface with IPEX ops. Addressing these points would further strengthen the contribution.
Given the complexity and the note about ongoing UT development, thorough internal testing will be key. I am not authorized to approve pull requests, but based on this review, after addressing the comments, the PR appears to be in a good state for further testing and consideration by other reviewers for merging.
vllm/_ipex_ops.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chunked_prefill function accepts several parameters (p_dropout, zero_tensors, return_softmax, gen_, seq_used_k) that are either hardcoded (e.g., p_dropout to 0.0, is_casual to True) or passed as None when calling ipex.llm.modules.PagedAttention.flash_attn_varlen_func.
Could you clarify if these parameters are intended as placeholders for future IPEX kernel capabilities, or to maintain API compatibility with other FlashAttention implementations?
If they are unlikely to be used by the IPEX backend, perhaps the signature could be simplified? Alternatively, adding comments explaining their fixed values or non-usage in the IPEX context might improve clarity for future maintainers. For example:
# p_dropout: float, # IPEX kernel does not support dropout, hence 0.0 is used.
# ... similar for other paramsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, keep used parameters and rename chunked_prefill to flash_attn_varlen_func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reshape_and_cache_flash function in vllm/_ipex_ops.py (which this calls) is defined with k_scale: float and v_scale: float. However, here layer._k_scale and layer._v_scale are passed, which are torch.Tensor objects.
Given the assertion on line 183 (layer._k_scale_float == 1.0 and layer._v_scale_float == 1.0), it implies that FP8 KV cache is not being used here, and these tensors likely contain the scalar value 1.0.
Does the underlying ipex.llm.modules.PagedAttention.reshape_and_cache_flash op correctly handle torch.Tensor inputs for scale factors, perhaps by implicitly converting tensor(1.0) to float(1.0), or are these scale factors ignored when not using an FP8-related kv_cache_dtype?
Ensuring type consistency or documenting the expected behavior by the IPEX op would be beneficial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch! updated.
13e97bc to
098aa07
Compare
|
@mgoin , we have resolved all Gemini review comments, thanks! |
|
Thanks for the update, @xuechendi! Glad to hear the review comments have been addressed. Let me know if you need anything else. |
3deaa05 to
b4ac463
Compare
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> some v1 fixes Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> remove useless file Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> remove Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add V1 test and set spawn in docker env Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add missing dependency Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix test Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update api name Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update api Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update default block size for v1 Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update memory usage Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix rebase issues Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix rebase, spec decode meta set to none Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add xpu v1 config check Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add mem log Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix init cache Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add xpu profiler for V1 Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update rebase issue Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update prepare_inputs for perf Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> refine xpu_model_runner Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
…one by default. The modification involves adding a check to prevent potential null exceptions。 (vllm-project#173) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: yan <yan.ma@intel.com> Co-authored-by: mayuyuace <qiming1.zhang@intel.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
|
This pull request has merge conflicts that must be resolved before it can be |
submit a new PR #19560 to simplify.
This PR add intel GPU V1 engine support. Leverage latest
chunked_prefillkernel(same API toflash_attn_varlen_funcv2) from ipex, 2.6, XPU V1 support is quite smooth most code can share with GPU side.a. Introduce new chunked prefill kernel implementd in ipex,
b. Refine some xpu related configs
c. Add IPEX_V1 attnention backend based on chunked prefill kernel
d. Support vLLM V1 by adding xpu_worker/xpu_model_runner to handle xpu code path
e. Docker file/ dependency updates.
f. Add a V1 test in CI.
I can split to smaller pieces if necessary.
When we start support V1 xpu last year, we follow the design in V0, implement almost same interface like cuda path did. And try to inherit from cuda path as much as possible. This is due to cuda path may have some cuda specific function call, like torch.tensor.cuda(). And there may be some xpu unsupported features like cuda graph.
The mainly differences are:
A) init device and distributed env, profiler: can not use "cuda", "nccl" hard code
B) a work around for client gpu memory metrics not correct issue.
C) XPUModelRunner init method: can not use parent's since it have some cuda specifc code like multi_processor_count
D) XPUModelRunner prepare_inputs() method: we use flash attn V2, some parameters are not same in flash attn v3
command I use: