-
Notifications
You must be signed in to change notification settings - Fork 29
update readme for vLLM 0.10.2 release on Intel GPU #869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yan Ma <yan.ma@intel.com>
vllm/0.10.2-xpu.md
Outdated
| | OneAPI | 2025.1.3-0 | | ||
| | PyTorch | PyTorch 2.8 | | ||
| | IPEX | 2.8.10 | | ||
| | OneCCL | 2021.15.4 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oneccl version is likely to be changed. keep it as a place holder for update when bkc release happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oneccl is now 2025.15.6
|
|
||
| vLLM supports pooling models such as embedding, classification and reward models. All of these models are now supported on Intel® GPUs. For detailed usage, refer [guide](https://docs.vllm.ai/en/latest/models/pooling_models.html). | ||
|
|
||
| * Pipeline Parallelism |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
vllm/0.10.2-xpu.md
Outdated
|
|
||
| * Data Parallelism | ||
|
|
||
| vLLM supports [Data Parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. But for Intel® GPUs, we currently don't support DP + EP for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This will work with both dense and MoE models. But for Intel® GPUs, we currently don't support DP + EP for now."
-> This will work with both dense and MoE models. Note export parallelism is under enabling that will be supported soon.
vllm/0.10.2-xpu.md
Outdated
| * **torch.compile**: Can be enabled for fp16/bf16 path. | ||
| * **speculative decoding**: Supports methods `n-gram`, `EAGLE` and `EAGLE3`. | ||
| * **async scheduling**: Can be enabled by `--async-scheduling`. This may help reduce the CPU overheads, leading to better latency and throughput. However, async scheduling is currently not supported with some features such as structured outputs, speculative decoding, and pipeline parallelism. | ||
| * **MoE models**: Models with MoE structure like gpt-oss, Deepseek-v2-lite and Qwen/Qwen3-30B-A3B are now supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MoE models are officially supported in this release, not "experimental". It's actually one of the key models we optimized, besides the multimodality.
Let's move the MoE models to the official feature list. GPT-OSS 20B and 120B in mxfp4 data type should be highlighted here.
vllm/0.10.2-xpu.md
Outdated
|
|
||
| The following issues are known issues: | ||
|
|
||
| * Qwen/Qwen3-30B-A3B need set `--gpu-memory-utilization=0.8` due to its high memory consumption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still the case, or fp16/bf16 only? for fp8 my understanding is that it can work with =0.9.
vllm/0.10.2-xpu.md
Outdated
|
|
||
| ## Optimizations | ||
|
|
||
| * FMHA Optimizations: XXXXX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attention kernel optimizations for decoding steps
MoE model optimizations using persistent MoE gemm kernel and fused activation kernel to reduce the kernel bubbles.
Signed-off-by: Yan Ma <yan.ma@intel.com>
vllm/0.10.2-xpu.md
Outdated
| * Gpt-oss 20B and 120B are supportted in MXFP4 with optimized performance. | ||
| * Attention kernel optimizations for decoding phase brings >10% e2e throughput improvement on 10+ models with 1k/512 as input/output len. | ||
| * MoE models are optimized using persistent MoE gemm kernel and fused activation kernel to reduce the kernel bubbles. Qwen3-30B-A3B achieved 2.6X e2e improvement and DeepSeek-V2-lite achieved 1.5X e2e improvement. | ||
| * vLLM 0.10.2 with new features: P/D disaggregation, DP, tooling, reasoning output, structured output. |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| | Multi Modality | Qwen/Qwen2.5-VL-72B-Instruct |✅︎|✅︎| | | ||
| | Multi Modality | Qwen/Qwen2.5-VL-32B-Instruct |✅︎|✅︎| | | ||
| | Embedding Model | Qwen/Qwen3-Embedding-8B |✅︎|✅︎| | | ||
| | Reranker Model | Qwen/Qwen3-Reranker-8B |✅︎|✅︎| | |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
Signed-off-by: Yan Ma <yan.ma@intel.com>
vllm/0.10.2-xpu.md
Outdated
| | OneAPI | 2025.1.3-0 | | ||
| | PyTorch | PyTorch 2.8 | | ||
| | IPEX | 2.8.10 | | ||
| | OneCCL | 2021.16.2 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oneCCL: 2021.15.6.2
| * More multi-modality models are supported with image/video as input, like InternVL series, MiniCPM-V-4, etc. | ||
| * vLLM 0.10.2 with new features: P/D disaggregation, DP, tooling, reasoning output, structured output. | ||
| * FP16/BF16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Gpt-oss 20B and 120B are supported in MXFP4 weight-only-quantization with optimized performance.
- Attention kernel optimizations in decoding phases for all workloads achieved >10% end-to-end throughput on 10+ models with all in/out sequence length.
- MoE models are optimized using persistent MoE gemm kernel and fused activation kernel to reduce the kernel bubbles. Qwen3-30B-A3B achieved 2.6x end-to-end improvement and DeepSeek-V2-lite achieved 1.5x end-to-end improvement.
- More multi-modality models are supported with image/video as input, like InternVL series, MiniCPM-V-4, etc.
- vLLM 0.10.2 with new features: Prefill/Decoding disaggregation, Data Parallel, tooling, reasoning output, structured output.
- FP16/BF16 gemm optimizations for batch size 1-128. Obvious improvement for small batch sizes.
vllm/0.10.2-xpu.md
Outdated
|
|
||
| Besides, following up vLLM V1 design, corresponding optimized kernels and features are implemented for Intel GPUs. | ||
|
|
||
| * chunked_prefill: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Chunked prefill
vllm/0.10.2-xpu.md
Outdated
|
|
||
| * chunked_prefill: | ||
|
|
||
| chunked_prefill is an optimization feature in vLLM that allows large prefill requests to be divided into small chunks and batched together with decode requests. This approach prioritizes decode requests, improving inter-token latency (ITL) and GPU utilization by combining compute-bound (prefill) and memory-bound (decode) requests in the same batch. vLLM v1 engine is built on this feature and in this release, it's also supported on intel GPUs by leveraging corresponding kernel from Intel® Extension for PyTorch\* for model execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chunked prefill
vllm/0.10.2-xpu.md
Outdated
|
|
||
| * Pooling Models Support | ||
|
|
||
| vLLM supports pooling models such as embedding, classification and reward models. All of these models are now supported on Intel® GPUs. For detailed usage, refer [guide](https://docs.vllm.ai/en/latest/models/pooling_models.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refer to [guide]
ccb27a3 to
311d997
Compare
sharvil10
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR provides notes for vLLM v0.10.2 release on Intel Multi-Arc, including some key features, optimizations and HowTos.