[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. #19560

jikunshang · 2025-06-12T15:10:05Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

previous PR is #14612
this PR add Intel GPU V1 engine support. This PR includes
a. integrates flash_attn_varlen_func kernel(V2) on xpu which is implemented in ipex,
b. Refine some xpu related configs
c. Support vLLM V1 by adding xpu_worker/xpu_model_runner to handle xpu code path
d. Docker file/ dependency updates.
e. Add a V1 test in CI.

Test Plan

Add a V1 test in xpu CI.
Intel vllm team is also working on UT on xpu fixing. We will upstream in further PRs for ut.

Test Result

I did a quick benchmark on both V0/V1 on PVC 1100. V1 have 1.55x throughput performance boost compare to V0 (3328.59 vs 2139.68). Please be aware both are not fine tuned parameters.
command I use:

VLLM_USE_V1=0 python3 benchmarks/benchmark_throughput.py --model meta-llama/Meta-Llama-3-8B-Instruct --dataset benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json  --enforce-eager 
VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model meta-llama/Meta-Llama-3-8B-Instruct --dataset benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json  --enforce-eager

(Optional) Documentation Update

github-actions · 2025-06-12T15:10:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @jikunshang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've added support for the vLLM V1 engine on Intel XPU hardware. This involves integrating the IPEX Flash Attention kernel, introducing new worker and model runner classes specifically for XPU under the V1 architecture, and updating various configuration and platform-specific checks. The changes also include necessary dependency and CI updates, and initial benchmarks indicate a significant performance improvement with the V1 engine on supported hardware.

Highlights

Intel XPU V1 Engine Support: This PR introduces initial support for the vLLM V1 engine on Intel XPU (GPU) hardware, enabling the use of the newer architecture and features.
IPEX Flash Attention Integration: The core of the V1 support involves integrating the flash_attn_varlen_func kernel (V2) provided by IPEX (Intel Extension for PyTorch) for optimized attention computation on XPU.
New XPU Worker and Model Runner: Dedicated XPUWorker and XPUModelRunner classes are added under the vllm.v1 namespace to handle the XPU-specific code path and configurations for the V1 engine.
Configuration and Platform Updates: Updates are made to platform detection, configuration checking (including block size for V1, bf16 support nuances for different Intel GPUs, and distributed backend handling), and environment variable usage (VLLM_WORKER_MULTIPROC_METHOD).
CI and Dependency Updates: A new test case for the V1 engine on XPU is added to the Buildkite CI script, and the numba dependency is pinned in requirements/xpu.txt.
Performance Improvement: Initial benchmarks on PVC 1100 show a 1.55x throughput performance boost for the V1 engine compared to V0.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds initial support for Intel GPUs (XPU) using the V1 engine, integrating the FlashAttention backend via IPEX. Key changes include:

XPU V1 Engine Integration: New XPUWorker and XPUModelRunner classes handle XPU-specific code paths for the V1 engine.
FlashAttention for XPU: flash_attn_varlen_func and reshape_and_cache_flash are now routed through ipex_ops for XPU.
Configuration Updates: XPU-related configurations in EngineArgs, Platform, and Dockerfiles have been updated to support V1.
CI: A new V1 test case for XPU has been added to the CI pipeline.

Overall, the changes look promising for enabling V1 features on Intel GPUs. I've identified a few areas for potential improvement:

Typo: A minor typo in a parameter name in _ipex_ops.py.
Unused Parameters: Some function parameters in _ipex_ops.py appear unused.
Memory Estimation: The FIXME in XPUWorker regarding non-torch memory allocation for client GPUs is a concern for accurate memory management.
Clarity and Robustness: Some logic in platform-specific configuration and error handling could be clarified or made more robust.

Please review the detailed comments. The performance improvements cited are significant, making this a valuable addition.

gemini-code-assist · 2025-06-12T15:11:19Z

vllm/platforms/xpu.py

The warning message "Please use spawn as start method if you want to use mp." is logged if VLLM_WORKER_MULTIPROC_METHOD is not 'spawn' when distributed_executor_backend is 'mp'. However, the code doesn't enforce or change the backend if this condition isn't met. This might lead to runtime issues if 'fork' is used, as hinted by the FIXME comment.

Consider either:

More strongly enforcing 'spawn' by changing distributed_executor_backend if 'mp' is chosen with an incompatible start method (similar to how it was handled before for 'mp' backend issues).

Raising an error or a more critical warning if an unsupported combination is detected.

gemini-code-assist · 2025-06-12T15:11:19Z

vllm/v1/worker/xpu_worker.py

The FIXME comment highlights an important limitation: memory_allocated() doesn't account for non-torch allocations, and a fixed 128MB is assumed. This could lead to inaccuracies in memory estimation, potentially causing OOM errors if non-torch allocations are significantly different.

While a perfect solution might be complex, consider if there are any platform-specific tools or libraries for XPU that could provide a more accurate measure of total used memory, or if this estimation needs to be configurable.

vllm/_ipex_ops.py

gemini-code-assist · 2025-06-12T15:11:19Z

vllm/_ipex_ops.py

The TODO comment indicates that FP8 KV cache support is pending. While this is acceptable for now, it's good practice to create a tracking issue (e.g., on GitHub) for this TODO and reference it in the comment. This helps ensure the task isn't forgotten and provides visibility into planned enhancements.

vllm/_ipex_ops.py

vllm/attention/utils/fa_utils.py

vllm/platforms/xpu.py

gemini-code-assist · 2025-06-12T15:11:20Z

vllm/platforms/xpu.py

The logic for device_support_bf16 has changed. Previously, it explicitly checked for "data center gpu" to return True and warned for unknown devices. Now, it defaults to True for devices not matching is_client_gpu_a770().

This implies a broader assumption of bf16 support. Ensure this is intended and well-tested across various Intel GPUs. The log message now encourages users to file an issue, which is good, but the previous more conservative approach might have prevented issues on unsupported/unknown hardware.

gemini-code-assist · 2025-06-12T15:11:20Z

vllm/v1/worker/xpu_model_runner.py

The FIXME comment indicates that cascade_attn_enabled needs verification. It's good to track this. Consider creating a GitHub issue for this verification and referencing it in the comment, e.g., # FIXME: To be verified. See #<issue_number>. This helps ensure the task is addressed.

gemini-code-assist · 2025-06-12T15:11:20Z

vllm/v1/worker/xpu_worker.py

The calculation of total_allocated_bytes as self.xpu_get_mem_info()[1] - self.xpu_get_mem_info()[0] seems to re-calculate free and total memory. It might be more efficient and clearer to use the values already obtained or to directly use torch.xpu.memory_allocated() for torch_allocated_bytes and then derive non_torch_allocations based on a more direct measure of total system allocations if available, or stick to the estimation method used in xpu_get_mem_info consistently.

gemini-code-assist · 2025-06-13T05:06:18Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-06-13T05:06:32Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-06-13T05:06:36Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2025-06-13T05:06:39Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

mergify · 2025-06-19T04:55:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jikunshang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

xuechendi · 2025-06-20T19:19:52Z

@WoosukKwon , may you review this PR?

WoosukKwon

LGTM! Thanks for the PR!

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> some v1 fixes Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> remove useless file Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> remove Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add V1 test and set spawn in docker env Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add missing dependency Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix test Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update api name Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update api Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update default block size for v1 Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update memory usage Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix rebase issues Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix rebase, spec decode meta set to none Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add xpu v1 config check Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add mem log Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix init cache Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> add xpu profiler for V1 Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update rebase issue Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update prepare_inputs for perf Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> refine xpu_model_runner Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix sampler, attn metadata Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix rebase issues Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix format issues Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> keep rebase Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Fix block_size setting hard code; (vllm-project#175) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Instances created using VllmConfig() typically have model_config as None by default. The modification involves adding a check to prevent potential null exceptions。 (vllm-project#173) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> sign off Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix format Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> fix typo Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update xpu path Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> address comments Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> address comments, refine chunk_prefill op API Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> refine APIs Co-authored-by: yan <yan.ma@intel.com> Co-authored-by: mayuyuace <qiming1.zhang@intel.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> only A770 will fallback to fp16 Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> update arg_utils Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> minimal code change, while performance is not best revert unnecessary code change Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

xuechendi · 2025-06-26T14:10:50Z

@simon-mo @WoosukKwon @robertgshaw2-redhat @mgoin , may you take a final review, hope we can merge it soon.

gshtras · 2025-06-26T17:52:02Z

vllm/v1/attention/backends/flash_attn.py

 from vllm.attention.ops.merge_attn_states import merge_attn_states
 from vllm.attention.utils.fa_utils import (flash_attn_supports_fp8,
-                                           get_flash_attn_version)
+                                           flash_attn_varlen_func,


This has to be under some guard rather than unconditional, as it doesn't exist on ROCm
@jikunshang

Hi, @gshtras , thanks for pointing the issue, added a quick fix in #20143

skyloevil · 2025-06-27T16:29:45Z

lgtm

jikunshang requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners June 12, 2025 15:10

gemini-code-assist bot reviewed Jun 12, 2025

View reviewed changes

mergify bot added ci/build v1 labels Jun 12, 2025

gemini-code-assist bot reviewed Jun 12, 2025

View reviewed changes

jikunshang mentioned this pull request Jun 12, 2025

[Hardware][Intel GPU] Add V1 engine support and chunked_prefill kernel #14612

Closed

jikunshang force-pushed the kunshang/xpu_v1_flash_attn branch from 765354d to 07d2f02 Compare June 13, 2025 05:05

jikunshang force-pushed the kunshang/xpu_v1_flash_attn branch from 07d2f02 to 645ee92 Compare June 17, 2025 04:43

mergify bot added the needs-rebase label Jun 19, 2025

jikunshang force-pushed the kunshang/xpu_v1_flash_attn branch from 645ee92 to 3f8d779 Compare June 19, 2025 07:11

mergify bot removed the needs-rebase label Jun 19, 2025

WoosukKwon approved these changes Jun 25, 2025

View reviewed changes

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 25, 2025

jikunshang force-pushed the kunshang/xpu_v1_flash_attn branch from 3f8d779 to 5cfe4b9 Compare June 26, 2025 04:11

jikunshang added 2 commits June 26, 2025 19:40

address comments

5cfe4b9

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

WoosukKwon merged commit b69781f into vllm-project:main Jun 26, 2025
74 checks passed

gshtras reviewed Jun 26, 2025

View reviewed changes

charlifu mentioned this pull request Jun 26, 2025

[Bugfix][Rocm] Fix import error on ROCm #20140

Closed

xuechendi mentioned this pull request Jun 26, 2025

Quick Fix by adding conditional import for flash_attn_varlen_func in flash_attn #20143

Merged

3 tasks

Uh oh!

[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. #19560

[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. #19560

Uh oh!

Conversation

jikunshang commented Jun 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jun 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Jun 13, 2025

Uh oh!

gemini-code-assist bot commented Jun 13, 2025

Uh oh!

gemini-code-assist bot commented Jun 13, 2025

Uh oh!

gemini-code-assist bot commented Jun 13, 2025

Uh oh!

mergify bot commented Jun 19, 2025

Uh oh!

xuechendi commented Jun 20, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

xuechendi commented Jun 26, 2025

Uh oh!

Uh oh!

gshtras Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

xuechendi Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

skyloevil commented Jun 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

jikunshang commented Jun 12, 2025 •

edited by github-actions bot

Loading