Skip to content

Commit 47433ad

Browse files
shen-shanshanYikun
authored andcommitted
[Doc] Update tutorials (#79)
### What this PR does / why we need it? Update tutorials. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
1 parent 36d1349 commit 47433ad

File tree

1 file changed

+23
-3
lines changed

1 file changed

+23
-3
lines changed

docs/source/tutorials.md

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,6 @@ Setup environment variables:
2828
```bash
2929
# Use Modelscope mirror to speed up model download
3030
export VLLM_USE_MODELSCOPE=True
31-
export MODELSCOPE_CACHE=/root/.cache/
3231

3332
# To avoid NPU out of memory, set `max_split_size_mb` to any value lower than you need to allocate for Qwen2.5-7B-Instruct
3433
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
@@ -82,7 +81,6 @@ docker run \
8281
-v /root/.cache:/root/.cache \
8382
-p 8000:8000 \
8483
-e VLLM_USE_MODELSCOPE=True \
85-
-e MODELSCOPE_CACHE=/root/.cache/ \
8684
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
8785
-it quay.io/ascend/vllm-ascend:latest \
8886
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
@@ -91,6 +89,14 @@ vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
9189
> [!NOTE]
9290
> Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240).
9391
92+
If your service start successfully, you can see the info shown below:
93+
94+
```bash
95+
INFO: Started server process [6873]
96+
INFO: Waiting for application startup.
97+
INFO: Application startup complete.
98+
```
99+
94100
Once your server is started, you can query the model with input prompts:
95101

96102
```bash
@@ -146,7 +152,6 @@ Setup environment variables:
146152
```bash
147153
# Use Modelscope mirror to speed up model download
148154
export VLLM_USE_MODELSCOPE=True
149-
export MODELSCOPE_CACHE=/root/.cache/
150155

151156
# To avoid NPU out of memory, set `max_split_size_mb` to any value lower than you need to allocate for Qwen2.5-7B-Instruct
152157
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
@@ -155,7 +160,19 @@ export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
155160
Run the following script to execute offline inference on multi-NPU:
156161

157162
```python
163+
import gc
164+
165+
import torch
166+
158167
from vllm import LLM, SamplingParams
168+
from vllm.distributed.parallel_state import (destroy_distributed_environment,
169+
destroy_model_parallel)
170+
171+
def clean_up():
172+
destroy_model_parallel()
173+
destroy_distributed_environment()
174+
gc.collect()
175+
torch.npu.empty_cache()
159176

160177
prompts = [
161178
"Hello, my name is",
@@ -172,6 +189,9 @@ for output in outputs:
172189
prompt = output.prompt
173190
generated_text = output.outputs[0].text
174191
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
192+
193+
del llm
194+
clean_up()
175195
```
176196

177197
If you run this script successfully, you can see the info shown below:

0 commit comments

Comments
 (0)