-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. #19560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. #19560
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @jikunshang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've added support for the vLLM V1 engine on Intel XPU hardware. This involves integrating the IPEX Flash Attention kernel, introducing new worker and model runner classes specifically for XPU under the V1 architecture, and updating various configuration and platform-specific checks. The changes also include necessary dependency and CI updates, and initial benchmarks indicate a significant performance improvement with the V1 engine on supported hardware.
Highlights
- Intel XPU V1 Engine Support: This PR introduces initial support for the vLLM V1 engine on Intel XPU (GPU) hardware, enabling the use of the newer architecture and features.
- IPEX Flash Attention Integration: The core of the V1 support involves integrating the
flash_attn_varlen_funckernel (V2) provided by IPEX (Intel Extension for PyTorch) for optimized attention computation on XPU. - New XPU Worker and Model Runner: Dedicated
XPUWorkerandXPUModelRunnerclasses are added under thevllm.v1namespace to handle the XPU-specific code path and configurations for the V1 engine. - Configuration and Platform Updates: Updates are made to platform detection, configuration checking (including block size for V1, bf16 support nuances for different Intel GPUs, and distributed backend handling), and environment variable usage (
VLLM_WORKER_MULTIPROC_METHOD). - CI and Dependency Updates: A new test case for the V1 engine on XPU is added to the Buildkite CI script, and the
numbadependency is pinned inrequirements/xpu.txt. - Performance Improvement: Initial benchmarks on PVC 1100 show a 1.55x throughput performance boost for the V1 engine compared to V0.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds initial support for Intel GPUs (XPU) using the V1 engine, integrating the FlashAttention backend via IPEX. Key changes include:
- XPU V1 Engine Integration: New
XPUWorkerandXPUModelRunnerclasses handle XPU-specific code paths for the V1 engine. - FlashAttention for XPU:
flash_attn_varlen_funcandreshape_and_cache_flashare now routed throughipex_opsfor XPU. - Configuration Updates: XPU-related configurations in
EngineArgs,Platform, and Dockerfiles have been updated to support V1. - CI: A new V1 test case for XPU has been added to the CI pipeline.
Overall, the changes look promising for enabling V1 features on Intel GPUs. I've identified a few areas for potential improvement:
- Typo: A minor typo in a parameter name in
_ipex_ops.py. - Unused Parameters: Some function parameters in
_ipex_ops.pyappear unused. - Memory Estimation: The
FIXMEinXPUWorkerregarding non-torch memory allocation for client GPUs is a concern for accurate memory management. - Clarity and Robustness: Some logic in platform-specific configuration and error handling could be clarified or made more robust.
Please review the detailed comments. The performance improvements cited are significant, making this a valuable addition.
vllm/platforms/xpu.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning message "Please use spawn as start method if you want to use mp." is logged if VLLM_WORKER_MULTIPROC_METHOD is not 'spawn' when distributed_executor_backend is 'mp'. However, the code doesn't enforce or change the backend if this condition isn't met. This might lead to runtime issues if 'fork' is used, as hinted by the FIXME comment.
Consider either:
- More strongly enforcing 'spawn' by changing
distributed_executor_backendif 'mp' is chosen with an incompatible start method (similar to how it was handled before for 'mp' backend issues). - Raising an error or a more critical warning if an unsupported combination is detected.
vllm/v1/worker/xpu_worker.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FIXME comment highlights an important limitation: memory_allocated() doesn't account for non-torch allocations, and a fixed 128MB is assumed. This could lead to inaccuracies in memory estimation, potentially causing OOM errors if non-torch allocations are significantly different.
While a perfect solution might be complex, consider if there are any platform-specific tools or libraries for XPU that could provide a more accurate measure of total used memory, or if this estimation needs to be configurable.
vllm/_ipex_ops.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment indicates that FP8 KV cache support is pending. While this is acceptable for now, it's good practice to create a tracking issue (e.g., on GitHub) for this TODO and reference it in the comment. This helps ensure the task isn't forgotten and provides visibility into planned enhancements.
vllm/platforms/xpu.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for device_support_bf16 has changed. Previously, it explicitly checked for "data center gpu" to return True and warned for unknown devices. Now, it defaults to True for devices not matching is_client_gpu_a770().
This implies a broader assumption of bf16 support. Ensure this is intended and well-tested across various Intel GPUs. The log message now encourages users to file an issue, which is good, but the previous more conservative approach might have prevented issues on unsupported/unknown hardware.
vllm/v1/worker/xpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm/v1/worker/xpu_worker.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation of total_allocated_bytes as self.xpu_get_mem_info()[1] - self.xpu_get_mem_info()[0] seems to re-calculate free and total memory. It might be more efficient and clearer to use the values already obtained or to directly use torch.xpu.memory_allocated() for torch_allocated_bytes and then derive non_torch_allocations based on a more direct measure of total system allocations if available, or stick to the estimation method used in xpu_get_mem_info consistently.
765354d to
07d2f02
Compare
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
3 similar comments
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
07d2f02 to
645ee92
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
645ee92 to
3f8d779
Compare
|
@WoosukKwon , may you review this PR? |
WoosukKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the PR!
3f8d779 to
5cfe4b9
Compare
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> some v1 fixes Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> remove useless file Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> remove Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add V1 test and set spawn in docker env Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add missing dependency Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix test Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update api name Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update api Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update default block size for v1 Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update memory usage Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix rebase issues Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix rebase, spec decode meta set to none Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add xpu v1 config check Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add mem log Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix init cache Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add xpu profiler for V1 Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update rebase issue Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update prepare_inputs for perf Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> refine xpu_model_runner Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix sampler, attn metadata Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix rebase issues Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix format issues Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> keep rebase Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Fix block_size setting hard code; (vllm-project#175) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Instances created using VllmConfig() typically have model_config as None by default. The modification involves adding a check to prevent potential null exceptions。 (vllm-project#173) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> sign off Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix format Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix typo Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update xpu path Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> address comments Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> address comments, refine chunk_prefill op API Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> refine APIs Co-authored-by: yan <yan.ma@intel.com> Co-authored-by: mayuyuace <qiming1.zhang@intel.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> only A770 will fallback to fp16 Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update arg_utils Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> minimal code change, while performance is not best revert unnecessary code change Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
|
@simon-mo @WoosukKwon @robertgshaw2-redhat @mgoin , may you take a final review, hope we can merge it soon. |
| from vllm.attention.ops.merge_attn_states import merge_attn_states | ||
| from vllm.attention.utils.fa_utils import (flash_attn_supports_fp8, | ||
| get_flash_attn_version) | ||
| flash_attn_varlen_func, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to be under some guard rather than unconditional, as it doesn't exist on ROCm
@jikunshang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
lgtm |
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
previous PR is #14612
this PR add Intel GPU V1 engine support. This PR includes
a. integrates
flash_attn_varlen_funckernel(V2) on xpu which is implemented in ipex,b. Refine some xpu related configs
c. Support vLLM V1 by adding xpu_worker/xpu_model_runner to handle xpu code path
d. Docker file/ dependency updates.
e. Add a V1 test in CI.
Test Plan
Add a V1 test in xpu CI.
Intel vllm team is also working on UT on xpu fixing. We will upstream in further PRs for ut.
Test Result
I did a quick benchmark on both V0/V1 on PVC 1100. V1 have 1.55x throughput performance boost compare to V0 (3328.59 vs 2139.68). Please be aware both are not fine tuned parameters.
command I use:
(Optional) Documentation Update