[Guide] V1 Engine #414

Closed

Labels

opened

on Mar 27, 2025

Overview

We added the basic V1 engine support in main and 0.7.3-dev branch. You can take a try now. Any feedback is welcome.

How to use V1

Installation

We can use main branch of vllm and vllm-ascend for a try:

# Install vLLM (latest)
git clone --depth 1 https://github.com/vllm-project/vllm
cd vllm
VLLM_TARGET_DEVICE=empty pip install . --extra-index https://download.pytorch.org/whl/cpu/

# Install vLLM Ascend (latest)
git clone --depth 1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e . --extra-index https://download.pytorch.org/whl/cpu/

Find more details here.

Usage

Before using V1, you need to set environment VLLM_USE_V1=1 and VLLM_WORKER_MULTIPROC_METHOD=spawn.

If you are using vllm for offline inferencing, you need to add a __main__ guard like as well:

if __name__ == '__main__':

    llm = vllm.LLM(...)

Find more details here.

Test

Currently, we enable the V1 engine E2E test on #389.

Run the command shown below to test V1 on vllm-ascend:

VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -sv tests

RoadMap

We're now working on V1 Engine full support. Here is the detail info:

Feature	vLLM Status	vllm-ascend Status	Next Step
Prefix Caching	🚀 Optimized	No	Rely on CANN 8.1, need more test
Chunked Prefill	🚀 Optimized	Don't supports MLA	Rely on V1 MLAAttention backend and V0 MLAAttention Chunked Prefill support
Logprobs Calculation	🟢 Functional	🟢 Functional
LoRA	🟢 Functional	🟢 Functional
Multimodal Models	🟢 Functional	🟢 Functional
FP8 KV Cache	🟢 Functional on Hopper devices	Unrelated
Spec Decode	🟢 Functional	🟢 Functional
Prompt Logprobs with Prefix Caching	🟢 Functional	No	Rely on Prefix Caching feature
Structured Output Alternative Backends	🟡 Planned	No	#177
Embedding Models	🟡 Planned
Mamba Models	🟡 Planned
Encoder-Decoder Models	🟡 Planned
Async Output	🟢 Functional	🟢 Functional
Multi Step Scheduler	🟢 Functional	🟢 Functional
Beam Search	🟢 Functional	🟢 Functional
Guided Decoding	🟢 Functional	🟢 Functional	#177
TP	🟢 Functional	🟢 Functional
PP	🟢 Functional	🟢 Functional
EP	🟢 Functional	Need test	Need improve performance
DP	🟢 Functional	No	Need add DP support
MTP	🟢 Functional	Need test	Need more functional test
Model Support	🟢 Functional	Only support Qwen-2/2.5
Quantization	🟢 Functional	No	working on w8a8 support
Ops	🟢 Functional	🟢 Functional
Request-level Structured Output Backend	🔴 Deprecated
best_of	🔴 Deprecated
Per-Request Logits Processors	🔴 Deprecated
GPU <> CPU KV Cache Swapping	🔴 Deprecated

Metadata

Assignees

No one assigned

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests