update readme for vLLM 0.10.2 release on Intel GPU #869

yma11 · 2025-09-24T06:55:06Z

This PR provides notes for vLLM v0.10.2 release on Intel Multi-Arc, including some key features, optimizations and HowTos.

Signed-off-by: Yan Ma <yan.ma@intel.com>

rogerxfeng8 · 2025-09-25T06:41:28Z

vllm/0.10.2-xpu.md

+| OneAPI      | 2025.1.3-0  |
+| PyTorch     | PyTorch 2.8  |
+| IPEX        | 2.8.10  |
+| OneCCL      | 2021.15.4   |


oneccl version is likely to be changed. keep it as a place holder for update when bkc release happened.

oneccl is now 2025.15.6

vllm/0.10.2-xpu.md

+
+  vLLM supports pooling models such as embedding, classification and reward models. All of these models are now supported on Intel® GPUs. For detailed usage, refer [guide](https://docs.vllm.ai/en/latest/models/pooling_models.html).
+
+* Pipeline Parallelism


rogerxfeng8 · 2025-09-25T06:49:40Z

vllm/0.10.2-xpu.md

+
+* Data Parallelism
+
+  vLLM supports [Data Parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. But for Intel® GPUs, we currently don't support DP + EP for now.


"This will work with both dense and MoE models. But for Intel® GPUs, we currently don't support DP + EP for now."
-> This will work with both dense and MoE models. Note export parallelism is under enabling that will be supported soon.

rogerxfeng8 · 2025-09-25T06:53:23Z

vllm/0.10.2-xpu.md

+* **torch.compile**: Can be enabled for fp16/bf16 path.
+* **speculative decoding**: Supports methods `n-gram`, `EAGLE` and `EAGLE3`.
+* **async scheduling**: Can be enabled by `--async-scheduling`. This may help reduce the CPU overheads, leading to better latency and throughput. However, async scheduling is currently not supported with some features such as structured outputs, speculative decoding, and pipeline parallelism.
+* **MoE models**: Models with MoE structure like gpt-oss, Deepseek-v2-lite and Qwen/Qwen3-30B-A3B are now supported.


MoE models are officially supported in this release, not "experimental". It's actually one of the key models we optimized, besides the multimodality.
Let's move the MoE models to the official feature list. GPT-OSS 20B and 120B in mxfp4 data type should be highlighted here.

rogerxfeng8 · 2025-09-25T08:51:56Z

vllm/0.10.2-xpu.md

+
+The following issues are known issues:
+
+* Qwen/Qwen3-30B-A3B need set `--gpu-memory-utilization=0.8` due to its high memory consumption.


Is this still the case, or fp16/bf16 only? for fp8 my understanding is that it can work with =0.9.

rogerxfeng8 · 2025-09-25T08:57:25Z

vllm/0.10.2-xpu.md

+
+## Optimizations
+
+* FMHA Optimizations: XXXXX.


Attention kernel optimizations for decoding steps
MoE model optimizations using persistent MoE gemm kernel and fused activation kernel to reduce the kernel bubbles.

Signed-off-by: Yan Ma <yan.ma@intel.com>

vllm/0.10.2-xpu.md

+* Gpt-oss 20B and 120B are supportted in MXFP4 with optimized performance.
+* Attention kernel optimizations for decoding phase brings >10% e2e throughput improvement on 10+ models with 1k/512 as input/output len.
+* MoE models are optimized using persistent MoE gemm kernel and fused activation kernel to reduce the kernel bubbles. Qwen3-30B-A3B achieved 2.6X e2e improvement and DeepSeek-V2-lite achieved 1.5X e2e improvement.
+* vLLM 0.10.2 with new features: P/D disaggregation, DP, tooling, reasoning output, structured output.


vllm/0.10.2-xpu.md

+| Multi Modality  | Qwen/Qwen2.5-VL-72B-Instruct              |✅︎|✅︎| |
+| Multi Modality  | Qwen/Qwen2.5-VL-32B-Instruct              |✅︎|✅︎| |
+| Embedding Model | Qwen/Qwen3-Embedding-8B                   |✅︎|✅︎| |
+| Reranker Model  | Qwen/Qwen3-Reranker-8B                    |✅︎|✅︎| |


Signed-off-by: Yan Ma <yan.ma@intel.com>

rogerxfeng8 · 2025-10-30T10:36:49Z

vllm/0.10.2-xpu.md

+| OneAPI      | 2025.1.3-0  |
+| PyTorch     | PyTorch 2.8  |
+| IPEX        | 2.8.10  |
+| OneCCL      | 2021.16.2   |


oneCCL: 2021.15.6.2

rogerxfeng8 · 2025-10-30T10:53:04Z

vllm/0.10.2-xpu.md

+* More multi-modality models are supported with image/video as input, like InternVL series, MiniCPM-V-4, etc.
+* vLLM 0.10.2 with new features: P/D disaggregation, DP, tooling, reasoning output, structured output.
+* FP16/BF16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
+


Gpt-oss 20B and 120B are supported in MXFP4 weight-only-quantization with optimized performance.

Attention kernel optimizations in decoding phases for all workloads achieved >10% end-to-end throughput on 10+ models with all in/out sequence length.

MoE models are optimized using persistent MoE gemm kernel and fused activation kernel to reduce the kernel bubbles. Qwen3-30B-A3B achieved 2.6x end-to-end improvement and DeepSeek-V2-lite achieved 1.5x end-to-end improvement.

More multi-modality models are supported with image/video as input, like InternVL series, MiniCPM-V-4, etc.

vLLM 0.10.2 with new features: Prefill/Decoding disaggregation, Data Parallel, tooling, reasoning output, structured output.

FP16/BF16 gemm optimizations for batch size 1-128. Obvious improvement for small batch sizes.

rogerxfeng8 · 2025-10-30T10:55:00Z

vllm/0.10.2-xpu.md

+
+Besides, following up vLLM V1 design, corresponding optimized kernels and features are implemented for Intel GPUs.
+
+* chunked_prefill:


Chunked prefill

rogerxfeng8 · 2025-10-30T10:55:23Z

vllm/0.10.2-xpu.md

+
+* chunked_prefill:
+
+  chunked_prefill is an optimization feature in vLLM that allows large prefill requests to be divided into small chunks and batched together with decode requests. This approach prioritizes decode requests, improving inter-token latency (ITL) and GPU utilization by combining compute-bound (prefill) and memory-bound (decode) requests in the same batch. vLLM v1 engine is built on this feature and in this release, it's also supported on intel GPUs by leveraging corresponding kernel from Intel® Extension for PyTorch\* for model execution.


Chunked prefill

rogerxfeng8 · 2025-10-30T11:08:10Z

vllm/0.10.2-xpu.md

+
+* Pooling Models Support
+
+  vLLM supports pooling models such as embedding, classification and reward models. All of these models are now supported on Intel® GPUs. For detailed usage, refer [guide](https://docs.vllm.ai/en/latest/models/pooling_models.html).


refer to [guide]

Signed-off-by: Yan Ma <yan.ma@intel.com>

sharvil10

LGTM

yma11 requested review from jitendra42, sharvil10 and sramakintel as code owners September 24, 2025 06:55

yma11 force-pushed the 2509 branch from 14c2a17 to f432c4f Compare September 24, 2025 06:57

add release note for vLLM v0.10.2 release

13ed748

Signed-off-by: Yan Ma <yan.ma@intel.com>

yma11 force-pushed the 2509 branch from f432c4f to 13ed748 Compare September 24, 2025 07:05

rogerxfeng8 reviewed Sep 25, 2025

View reviewed changes

add what's new and new supported models

d84fe3b

Signed-off-by: Yan Ma <yan.ma@intel.com>

yma11 force-pushed the 2509 branch from 8c5f663 to d84fe3b Compare September 26, 2025 11:20

Merge branch 'main' into 2509

334251a

rogerxfeng8 reviewed Oct 29, 2025

View reviewed changes

address comments

d325e86

Signed-off-by: Yan Ma <yan.ma@intel.com>

rogerxfeng8 reviewed Oct 30, 2025

View reviewed changes

yma11 force-pushed the 2509 branch 2 times, most recently from ccb27a3 to 311d997 Compare October 30, 2025 11:49

update

24bbe5b

Signed-off-by: Yan Ma <yan.ma@intel.com>

yma11 force-pushed the 2509 branch from 311d997 to 24bbe5b Compare October 30, 2025 12:25

rogerxfeng8 approved these changes Oct 30, 2025

View reviewed changes

Merge branch 'main' into 2509

8d6e87f

sramakintel approved these changes Oct 30, 2025

View reviewed changes

sharvil10 approved these changes Oct 30, 2025

View reviewed changes

sharvil10 merged commit cf838b4 into intel:main Oct 31, 2025
7 checks passed


		vLLM supports pooling models such as embedding, classification and reward models. All of these models are now supported on Intel® GPUs. For detailed usage, refer [guide](https://docs.vllm.ai/en/latest/models/pooling_models.html).

		* Pipeline Parallelism


		* Data Parallelism

		vLLM supports [Data Parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. But for Intel® GPUs, we currently don't support DP + EP for now.


		The following issues are known issues:

		* Qwen/Qwen3-30B-A3B need set `--gpu-memory-utilization=0.8` due to its high memory consumption.


		Besides, following up vLLM V1 design, corresponding optimized kernels and features are implemented for Intel GPUs.

		* chunked_prefill:


		* chunked_prefill:

		chunked_prefill is an optimization feature in vLLM that allows large prefill requests to be divided into small chunks and batched together with decode requests. This approach prioritizes decode requests, improving inter-token latency (ITL) and GPU utilization by combining compute-bound (prefill) and memory-bound (decode) requests in the same batch. vLLM v1 engine is built on this feature and in this release, it's also supported on intel GPUs by leveraging corresponding kernel from Intel® Extension for PyTorch\* for model execution.


		* Pooling Models Support

		vLLM supports pooling models such as embedding, classification and reward models. All of these models are now supported on Intel® GPUs. For detailed usage, refer [guide](https://docs.vllm.ai/en/latest/models/pooling_models.html).

update readme for vLLM 0.10.2 release on Intel GPU #869

update readme for vLLM 0.10.2 release on Intel GPU #869

Uh oh!

Conversation

yma11 commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogerxfeng8 Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sharvil10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yma11 commented Sep 24, 2025 •

edited

Loading

rogerxfeng8 Sep 25, 2025 •

edited

Loading