Skip to content

Conversation

@jikunshang
Copy link
Collaborator

@jikunshang jikunshang commented Jun 12, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

previous PR is #14612
this PR add Intel GPU V1 engine support. This PR includes
a. integrates flash_attn_varlen_func kernel(V2) on xpu which is implemented in ipex,
b. Refine some xpu related configs
c. Support vLLM V1 by adding xpu_worker/xpu_model_runner to handle xpu code path
d. Docker file/ dependency updates.
e. Add a V1 test in CI.

Test Plan

Add a V1 test in xpu CI.
Intel vllm team is also working on UT on xpu fixing. We will upstream in further PRs for ut.

Test Result

I did a quick benchmark on both V0/V1 on PVC 1100. V1 have 1.55x throughput performance boost compare to V0 (3328.59 vs 2139.68). Please be aware both are not fine tuned parameters.
command I use:

VLLM_USE_V1=0 python3 benchmarks/benchmark_throughput.py --model meta-llama/Meta-Llama-3-8B-Instruct --dataset benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json  --enforce-eager 
VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model meta-llama/Meta-Llama-3-8B-Instruct --dataset benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json  --enforce-eager 

(Optional) Documentation Update

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @jikunshang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've added support for the vLLM V1 engine on Intel XPU hardware. This involves integrating the IPEX Flash Attention kernel, introducing new worker and model runner classes specifically for XPU under the V1 architecture, and updating various configuration and platform-specific checks. The changes also include necessary dependency and CI updates, and initial benchmarks indicate a significant performance improvement with the V1 engine on supported hardware.

Highlights

  • Intel XPU V1 Engine Support: This PR introduces initial support for the vLLM V1 engine on Intel XPU (GPU) hardware, enabling the use of the newer architecture and features.
  • IPEX Flash Attention Integration: The core of the V1 support involves integrating the flash_attn_varlen_func kernel (V2) provided by IPEX (Intel Extension for PyTorch) for optimized attention computation on XPU.
  • New XPU Worker and Model Runner: Dedicated XPUWorker and XPUModelRunner classes are added under the vllm.v1 namespace to handle the XPU-specific code path and configurations for the V1 engine.
  • Configuration and Platform Updates: Updates are made to platform detection, configuration checking (including block size for V1, bf16 support nuances for different Intel GPUs, and distributed backend handling), and environment variable usage (VLLM_WORKER_MULTIPROC_METHOD).
  • CI and Dependency Updates: A new test case for the V1 engine on XPU is added to the Buildkite CI script, and the numba dependency is pinned in requirements/xpu.txt.
  • Performance Improvement: Initial benchmarks on PVC 1100 show a 1.55x throughput performance boost for the V1 engine compared to V0.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds initial support for Intel GPUs (XPU) using the V1 engine, integrating the FlashAttention backend via IPEX. Key changes include:

  • XPU V1 Engine Integration: New XPUWorker and XPUModelRunner classes handle XPU-specific code paths for the V1 engine.
  • FlashAttention for XPU: flash_attn_varlen_func and reshape_and_cache_flash are now routed through ipex_ops for XPU.
  • Configuration Updates: XPU-related configurations in EngineArgs, Platform, and Dockerfiles have been updated to support V1.
  • CI: A new V1 test case for XPU has been added to the CI pipeline.

Overall, the changes look promising for enabling V1 features on Intel GPUs. I've identified a few areas for potential improvement:

  • Typo: A minor typo in a parameter name in _ipex_ops.py.
  • Unused Parameters: Some function parameters in _ipex_ops.py appear unused.
  • Memory Estimation: The FIXME in XPUWorker regarding non-torch memory allocation for client GPUs is a concern for accurate memory management.
  • Clarity and Robustness: Some logic in platform-specific configuration and error handling could be clarified or made more robust.

Please review the detailed comments. The performance improvements cited are significant, making this a valuable addition.

Comment on lines 124 to 127
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The warning message "Please use spawn as start method if you want to use mp." is logged if VLLM_WORKER_MULTIPROC_METHOD is not 'spawn' when distributed_executor_backend is 'mp'. However, the code doesn't enforce or change the backend if this condition isn't met. This might lead to runtime issues if 'fork' is used, as hinted by the FIXME comment.

Consider either:

  1. More strongly enforcing 'spawn' by changing distributed_executor_backend if 'mp' is chosen with an incompatible start method (similar to how it was handled before for 'mp' backend issues).
  2. Raising an error or a more critical warning if an unsupported combination is detected.

Comment on lines +61 to +62
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The FIXME comment highlights an important limitation: memory_allocated() doesn't account for non-torch allocations, and a fixed 128MB is assumed. This could lead to inaccuracies in memory estimation, potentially causing OOM errors if non-torch allocations are significantly different.

While a perfect solution might be complex, consider if there are any platform-specific tools or libraries for XPU that could provide a more accurate measure of total used memory, or if this estimation needs to be configurable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TODO comment indicates that FP8 KV cache support is pending. While this is acceptable for now, it's good practice to create a tracking issue (e.g., on GitHub) for this TODO and reference it in the comment. This helps ensure the task isn't forgotten and provides visibility into planned enhancements.

Comment on lines +160 to 164
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for device_support_bf16 has changed. Previously, it explicitly checked for "data center gpu" to return True and warned for unknown devices. Now, it defaults to True for devices not matching is_client_gpu_a770().

This implies a broader assumption of bf16 support. Ensure this is intended and well-tested across various Intel GPUs. The log message now encourages users to file an issue, which is good, but the previous more conservative approach might have prevented issues on unsupported/unknown hardware.

Comment on lines +25 to +26
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The FIXME comment indicates that cascade_attn_enabled needs verification. It's good to track this. Consider creating a GitHub issue for this verification and referencing it in the comment, e.g., # FIXME: To be verified. See #<issue_number>. This helps ensure the task is addressed.

Comment on lines +109 to +112
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calculation of total_allocated_bytes as self.xpu_get_mem_info()[1] - self.xpu_get_mem_info()[0] seems to re-calculate free and total memory. It might be more efficient and clearer to use the values already obtained or to directly use torch.xpu.memory_allocated() for torch_allocated_bytes and then derive non_torch_allocations based on a more direct measure of total system allocations if available, or stick to the estimation method used in xpu_get_mem_info consistently.

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

3 similar comments
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@jikunshang jikunshang force-pushed the kunshang/xpu_v1_flash_attn branch from 07d2f02 to 645ee92 Compare June 17, 2025 04:43
@mergify
Copy link

mergify bot commented Jun 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jikunshang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 19, 2025
@jikunshang jikunshang force-pushed the kunshang/xpu_v1_flash_attn branch from 645ee92 to 3f8d779 Compare June 19, 2025 07:11
@mergify mergify bot removed the needs-rebase label Jun 19, 2025
@xuechendi
Copy link
Contributor

@WoosukKwon , may you review this PR?

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the PR!

@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 25, 2025
@jikunshang jikunshang force-pushed the kunshang/xpu_v1_flash_attn branch from 3f8d779 to 5cfe4b9 Compare June 26, 2025 04:11
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

some v1 fixes

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

remove useless file

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

remove

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

add V1 test and set spawn in docker env

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

add missing dependency

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

fix test

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

update api name

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

update api

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

update default block size for v1

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

update memory usage

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

fix rebase issues

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

fix rebase, spec decode meta set to none

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

add xpu v1 config check

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

add mem log

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

fix init cache

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

add xpu profiler for V1

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

update rebase issue

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

update prepare_inputs for perf

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

update

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

refine xpu_model_runner

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

fix sampler, attn metadata

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

fix rebase issues

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

fix format issues

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

keep rebase

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

Fix block_size setting hard code; (vllm-project#175)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

Instances created using VllmConfig() typically have model_config as None by default. The modification involves adding a check to prevent potential null exceptions。 (vllm-project#173)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

sign off

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

fix format

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

fix

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

fix typo

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

update xpu path

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

address comments

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

address comments, refine chunk_prefill op API

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

refine APIs

Co-authored-by: yan <yan.ma@intel.com>
Co-authored-by: mayuyuace <qiming1.zhang@intel.com>

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

only A770 will fallback to fp16

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

update arg_utils

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

minimal code change, while performance is not best

revert unnecessary code change

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
@xuechendi
Copy link
Contributor

@simon-mo @WoosukKwon @robertgshaw2-redhat @mgoin , may you take a final review, hope we can merge it soon.

@WoosukKwon WoosukKwon merged commit b69781f into vllm-project:main Jun 26, 2025
74 checks passed
from vllm.attention.ops.merge_attn_states import merge_attn_states
from vllm.attention.utils.fa_utils import (flash_attn_supports_fp8,
get_flash_attn_version)
flash_attn_varlen_func,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be under some guard rather than unconditional, as it doesn't exist on ROCm
@jikunshang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @gshtras , thanks for pointing the issue, added a quick fix in #20143

@skyloevil
Copy link
Contributor

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants