Skip to content

[Guide] V1 Engine #414

@shen-shanshan

Description

@shen-shanshan

Overview

We added the basic V1 engine support in main and 0.7.3-dev branch. You can take a try now. Any feedback is welcome.

How to use V1

Installation

We can use main branch of vllm and vllm-ascend for a try:

# Install vLLM (latest)
git clone --depth 1 https://github.com/vllm-project/vllm
cd vllm
VLLM_TARGET_DEVICE=empty pip install . --extra-index https://download.pytorch.org/whl/cpu/

# Install vLLM Ascend (latest)
git clone --depth 1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e . --extra-index https://download.pytorch.org/whl/cpu/

Find more details here.

Usage

Before using V1, you need to set environment VLLM_USE_V1=1 and VLLM_WORKER_MULTIPROC_METHOD=spawn.

If you are using vllm for offline inferencing, you need to add a __main__ guard like as well:

if __name__ == '__main__':

    llm = vllm.LLM(...)

Find more details here.

Test

Currently, we enable the V1 engine E2E test on #389.

Run the command shown below to test V1 on vllm-ascend:

VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -sv tests

RoadMap

We're now working on V1 Engine full support. Here is the detail info:

Feature vLLM Status vllm-ascend Status Next Step
Prefix Caching 🚀 Optimized No Rely on CANN 8.1, need more test
Chunked Prefill 🚀 Optimized Don't supports MLA Rely on V1 MLAAttention backend and V0 MLAAttention Chunked Prefill support
Logprobs Calculation 🟢 Functional 🟢 Functional
LoRA 🟢 Functional 🟢 Functional
Multimodal Models 🟢 Functional 🟢 Functional
FP8 KV Cache 🟢 Functional on Hopper devices Unrelated
Spec Decode 🟢 Functional 🟢 Functional
Prompt Logprobs with Prefix Caching 🟢 Functional No Rely on Prefix Caching feature
Structured Output Alternative Backends 🟡 Planned No #177
Embedding Models 🟡 Planned
Mamba Models 🟡 Planned
Encoder-Decoder Models 🟡 Planned
Async Output 🟢 Functional 🟢 Functional
Multi Step Scheduler 🟢 Functional 🟢 Functional
Beam Search 🟢 Functional 🟢 Functional
Guided Decoding 🟢 Functional 🟢 Functional #177
TP 🟢 Functional 🟢 Functional
PP 🟢 Functional 🟢 Functional
EP 🟢 Functional Need test Need improve performance
DP 🟢 Functional No Need add DP support
MTP 🟢 Functional Need test Need more functional test
Model Support 🟢 Functional Only support Qwen-2/2.5
Quantization 🟢 Functional No working on w8a8 support
Ops 🟢 Functional 🟢 Functional
Request-level Structured Output Backend 🔴 Deprecated
best_of 🔴 Deprecated
Per-Request Logits Processors 🔴 Deprecated
GPU <> CPU KV Cache Swapping 🔴 Deprecated

Metadata

Metadata

Assignees

No one assigned

    Labels

    guideguide note

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions