Skip to content

Commit cbebd7b

Browse files
committed
update
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
1 parent c403ec5 commit cbebd7b

File tree

2 files changed

+214
-1
lines changed

2 files changed

+214
-1
lines changed

docs/source/installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Before installing the package, you need to make sure firmware/driver and CANN i
2121

2222
### Install firmwares and drivers
2323

24-
To verify that the Ascend NPU firmware and driver were correctly installed, run `npu-smi` info
24+
To verify that the Ascend NPU firmware and driver were correctly installed, run `npu-smi info`.
2525

2626
> Tips: Refer to [Ascend Environment Setup Guide](https://ascend.github.io/docs/sources/ascend/quick_install.html) for more details.
2727

docs/source/tutorials.md

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# Tutorials
2+
3+
## Run vllm-ascend on a Single NPU
4+
5+
### Offline Inference on a Single NPU
6+
7+
Run docker container:
8+
9+
```bash
10+
docker run \
11+
--name vllm-ascend \
12+
--device /dev/davinci0 \
13+
--device /dev/davinci_manager \
14+
--device /dev/devmm_svm \
15+
--device /dev/hisi_hdc \
16+
-v /usr/local/dcmi:/usr/local/dcmi \
17+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
18+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
19+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
20+
-v /etc/ascend_install.info:/etc/ascend_install.info \
21+
-v /root/models:/root/models \
22+
-p 8000:8000 \
23+
-it quay.io/ascend/vllm-ascend:latest bash
24+
```
25+
26+
Use Modelscope mirror to speed up model download:
27+
28+
```bash
29+
pip install modelscope
30+
export VLLM_USE_MODELSCOPE=True
31+
export MODELSCOPE_CACHE=/root/models/
32+
```
33+
34+
To avoid NPU out of memory, set `max_split_size_mb` to any value lower than you need to allocate:
35+
36+
```bash
37+
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
38+
```
39+
40+
> [!NOTE]
41+
> `max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html).
42+
43+
Run the following script to execute offline inference on a single NPU:
44+
45+
```python
46+
from vllm import LLM, SamplingParams
47+
48+
prompts = [
49+
"Hello, my name is",
50+
"The future of AI is",
51+
]
52+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
53+
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", max_model_len=26240)
54+
55+
outputs = llm.generate(prompts, sampling_params)
56+
for output in outputs:
57+
prompt = output.prompt
58+
generated_text = output.outputs[0].text
59+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
60+
```
61+
62+
If you run this script successfully, you can see the info shown below:
63+
64+
```bash
65+
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
66+
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
67+
```
68+
69+
### Online Serving on a Single NPU
70+
71+
Run docker container:
72+
73+
```bash
74+
docker run \
75+
--name vllm-ascend \
76+
--device /dev/davinci0 \
77+
--device /dev/davinci_manager \
78+
--device /dev/devmm_svm \
79+
--device /dev/hisi_hdc \
80+
-v /usr/local/dcmi:/usr/local/dcmi \
81+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
82+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
83+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
84+
-v /etc/ascend_install.info:/etc/ascend_install.info \
85+
-v /root/models:/root/models \
86+
-p 8000:8000 \
87+
-it quay.io/ascend/vllm-ascend:latest bash
88+
```
89+
90+
Use Modelscope mirror to speed up model download:
91+
92+
```bash
93+
pip install modelscope
94+
export VLLM_USE_MODELSCOPE=True
95+
export MODELSCOPE_CACHE=/root/models/
96+
```
97+
98+
Set `max_split_size_mb` to any value lower than you need to allocate:
99+
100+
```bash
101+
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
102+
```
103+
104+
Start the vLLM server on a single NPU:
105+
106+
```
107+
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
108+
```
109+
110+
> [!NOTE]
111+
> Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240).
112+
113+
Once your server is started, you can query the model with input prompts:
114+
115+
```bash
116+
curl http://localhost:8000/v1/completions \
117+
-H "Content-Type: application/json" \
118+
-d '{
119+
"model": "Qwen/Qwen2.5-7B-Instruct",
120+
"prompt": "The future of AI is",
121+
"max_tokens": 7,
122+
"temperature": 0
123+
}'
124+
```
125+
126+
If you query the server successfully, you can see the info shown below (client):
127+
128+
```bash
129+
{"id":"cmpl-b25a59a2f985459781ce7098aeddfda7","object":"text_completion","created":1739523925,"model":"Qwen/Qwen2.5-7B-Instruct","choices":[{"index":0,"text":" here. It’s not just a","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}}
130+
```
131+
132+
Logs of the vllm server:
133+
134+
```bash
135+
INFO: 172.17.0.1:49518 - "POST /v1/completions HTTP/1.1" 200 OK
136+
INFO 02-13 08:34:35 logger.py:39] Received request cmpl-574f00e342904692a73fb6c1c986c521-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None.
137+
```
138+
139+
## Run vllm-ascend on Multi-NPU
140+
141+
### Distributed Inference on Multi-NPU
142+
143+
Run docker container:
144+
145+
```bash
146+
docker run \
147+
--name vllm-ascend \
148+
--device /dev/davinci0 \
149+
--device /dev/davinci1 \
150+
--device /dev/davinci_manager \
151+
--device /dev/devmm_svm \
152+
--device /dev/hisi_hdc \
153+
-v /usr/local/dcmi:/usr/local/dcmi \
154+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
155+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
156+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
157+
-v /etc/ascend_install.info:/etc/ascend_install.info \
158+
-v /root/models:/root/models \
159+
-p 8000:8000 \
160+
-it quay.io/ascend/vllm-ascend:latest bash
161+
```
162+
163+
Use Modelscope mirror to speed up model download:
164+
165+
```bash
166+
pip install modelscope
167+
export VLLM_USE_MODELSCOPE=True
168+
export MODELSCOPE_CACHE=/root/models/
169+
```
170+
171+
Set `max_split_size_mb` to any value lower than you need to allocate:
172+
173+
```bash
174+
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
175+
```
176+
177+
Run the following script to execute offline inference on multi-NPU:
178+
179+
```python
180+
from vllm import LLM, SamplingParams
181+
182+
def clean_up():
183+
destroy_model_parallel()
184+
destroy_distributed_environment()
185+
gc.collect()
186+
torch.npu.empty_cache()
187+
188+
prompts = [
189+
"Hello, my name is",
190+
"The future of AI is",
191+
]
192+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
193+
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct",
194+
tensor_parallel_size=2,
195+
distributed_executor_backend="mp",
196+
max_model_len=26240)
197+
198+
outputs = llm.generate(prompts, sampling_params)
199+
for output in outputs:
200+
prompt = output.prompt
201+
generated_text = output.outputs[0].text
202+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
203+
204+
del llm
205+
clean_up()
206+
```
207+
208+
If you run this script successfully, you can see the info shown below:
209+
210+
```bash
211+
Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
212+
Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the'
213+
```

0 commit comments

Comments
 (0)