- 
                Notifications
    
You must be signed in to change notification settings  - Fork 532
 
Description
This is a living document! We are eager to know what do you want for vLLM Ascend in Q3 2025. Any feedback is welcome.
Welcome to join vLLM Ascend Weekly Meeting.
Release plan
Next release v0.10.3 (v0.10.3rc1) (about 2025.09.23): https://github.com/vllm-project/vllm-ascend/milestone/4
As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM serving for everyone on Ascend NPU, and to actively contribute to the enrichment of vLLM.
In 2025 Q2, we have focused on 4 themes: vLLM Ascend for Production, Performance Optimization, Key Features, Ecosystem Connect. In 2025 Q3, we will focus on: Default V1 Engine、Quality and Production ready、User / Developer Experience、Competitive for Key Workflow.
1. Default V1 Engine
- Stable plugin architecture for hardware platforms [RFC]: Clear and stable interface for platform in vLLM vllm#22082
 - V1 Engine fully supports and cleanup V0 code path: [Feature]: Enable V1 by default and cleanup V0 code #1620
 - Enable CustomOP register: [CustomOP][Refactor] Register CustomOP instead of overwrite forward_oot #1647
 -  V1 feature support enhancement
- Enc-Dec models
 - V1 PP supports [V1][PP] Support pp with ray backend in V1 #1800
 - xPyD Disaggregate prefill for kv cache register style #950
 - V1 scheduler
 
 
2. Quality and Production ready
-  Unit test coverage enhancement: [RFC]: Unit test coverage improvement #1298
- Coverage report: https://app.codecov.io/gh/vllm-project/vllm-ascend
 
 - E2e Test coverage
 - Model Support vLLM Ascend Model Support Priority #1608
 - Benchmark Test
 - Accuracy Test
 -  Module refactor
- model arch
 - ops
 - attention
 - torchair
 - quantization
 
 
3. User / Developer Experience
- Users doc: [RFC]: Doc enhancement #1248
 - Developer Design doc: [RFC]: Doc enhancement #1248
 - Distributions
 - Perf Dashboard and Accuracy report
 -  Developer Experience
- vLLM commit hash recording: Record vLLM commit in PR description #1623
 
 
4. Competitive for Key Workflow
- 
Large Scale Serving
- EPLB
- Dynamic EPLB: [RFC]: Dynamic Expert Load Balance with Zero-like-overhead vllm#22246
 - static EPLB: Add static EPLB #1116
 
 - Qwen series (Qwen3 / Qwen3 MoE) optimization [Perf] Optimize perf of Qwen3 #1245
 - Qwen series (Qwen3 MoE) optimization: [Bugfix] Support Qwen3-MOE on aclgraph mode #1381
 - Disaggregated Prefilling
 - CP/SP [RFC]: Context Parallelism && Sequence Parallelism vllm#22693
 - AF Disaggregated [RFC]: ATTN-FFN Disaggregation for MoE Models vllm#22799
 
 - EPLB
 - 
RLHF
- Performance improvements
 - Parallel support
 
 - 
Graph mode
- Support Full Graph with multiple attention kernels: [RFC]: Support Full Graph with multiple attention kernels #1649
 - Automatic Kernel Fusion via torch.fx.graph and graph rewriter for vLLM-Ascend: [RFC]: Automatic Kernel Fusion via torch.fx.graph and graph rewriter for vLLM-Ascend #2386
 
 - 
Model
- Qwen/DeepSeek/Qwen VL series
 - Gemma3
 - K2
 -  New model support vLLM Ascend Model Support Priority #1608
- New trending models like: minimax / hunyuan / ERNIE - Quantization support: w4a16/w4a8 for Dense model
 - Quantization support: w4a16/w4a8 for MoE model
 - Model format support: awq, gguf
 
 - 
Others
- Atlas 300I series experimental support and perf enhancement: [Performance] Disable JIT and nd2nz to improve performance for Altlas 300I series #1591
 - LoRA performance enhancement. Add Custom Kernels For LoRA Performance #1884
 
 
If any of the items you wanted is not on the roadmap, your suggestion and contribution is strongly welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.
Historical Roadmap: