-
Notifications
You must be signed in to change notification settings - Fork 537
Closed
Labels
guideguide noteguide note
Description
Overview
We added the basic V1 engine support in main and 0.7.3-dev branch. You can take a try now. Any feedback is welcome.
How to use V1
Installation
We can use main branch of vllm and vllm-ascend for a try:
# Install vLLM (latest)
git clone --depth 1 https://github.com/vllm-project/vllm
cd vllm
VLLM_TARGET_DEVICE=empty pip install . --extra-index https://download.pytorch.org/whl/cpu/
# Install vLLM Ascend (latest)
git clone --depth 1 https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
pip install -e . --extra-index https://download.pytorch.org/whl/cpu/Find more details here.
Usage
Before using V1, you need to set environment VLLM_USE_V1=1 and VLLM_WORKER_MULTIPROC_METHOD=spawn.
If you are using vllm for offline inferencing, you need to add a __main__ guard like as well:
if __name__ == '__main__':
llm = vllm.LLM(...)Find more details here.
Test
Currently, we enable the V1 engine E2E test on #389.
Run the command shown below to test V1 on vllm-ascend:
VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -sv testsRoadMap
We're now working on V1 Engine full support. Here is the detail info:
| Feature | vLLM Status | vllm-ascend Status | Next Step |
|---|---|---|---|
| Prefix Caching | 🚀 Optimized | No | Rely on CANN 8.1, need more test |
| Chunked Prefill | 🚀 Optimized | Don't supports MLA | Rely on V1 MLAAttention backend and V0 MLAAttention Chunked Prefill support |
| Logprobs Calculation | 🟢 Functional | 🟢 Functional | |
| LoRA | 🟢 Functional | 🟢 Functional | |
| Multimodal Models | 🟢 Functional | 🟢 Functional | |
| FP8 KV Cache | 🟢 Functional on Hopper devices | Unrelated | |
| Spec Decode | 🟢 Functional | 🟢 Functional | |
| Prompt Logprobs with Prefix Caching | 🟢 Functional | No | Rely on Prefix Caching feature |
| Structured Output Alternative Backends | 🟡 Planned | No | #177 |
| Embedding Models | 🟡 Planned | ||
| Mamba Models | 🟡 Planned | ||
| Encoder-Decoder Models | 🟡 Planned | ||
| Async Output | 🟢 Functional | 🟢 Functional | |
| Multi Step Scheduler | 🟢 Functional | 🟢 Functional | |
| Beam Search | 🟢 Functional | 🟢 Functional | |
| Guided Decoding | 🟢 Functional | 🟢 Functional | #177 |
| TP | 🟢 Functional | 🟢 Functional | |
| PP | 🟢 Functional | 🟢 Functional | |
| EP | 🟢 Functional | Need test | Need improve performance |
| DP | 🟢 Functional | No | Need add DP support |
| MTP | 🟢 Functional | Need test | Need more functional test |
| Model Support | 🟢 Functional | Only support Qwen-2/2.5 | |
| Quantization | 🟢 Functional | No | working on w8a8 support |
| Ops | 🟢 Functional | 🟢 Functional | |
| Request-level Structured Output Backend | 🔴 Deprecated | ||
| best_of | 🔴 Deprecated | ||
| Per-Request Logits Processors | 🔴 Deprecated | ||
| GPU <> CPU KV Cache Swapping | 🔴 Deprecated |
Metadata
Metadata
Assignees
Labels
guideguide noteguide note