|
| 1 | +# Tutorials |
| 2 | + |
| 3 | +## Run vllm-ascend on a Single NPU |
| 4 | + |
| 5 | +### Offline Inference on a Single NPU |
| 6 | + |
| 7 | +Run docker container: |
| 8 | + |
| 9 | +```bash |
| 10 | +docker run \ |
| 11 | +--name vllm-ascend \ |
| 12 | +--device /dev/davinci0 \ |
| 13 | +--device /dev/davinci_manager \ |
| 14 | +--device /dev/devmm_svm \ |
| 15 | +--device /dev/hisi_hdc \ |
| 16 | +-v /usr/local/dcmi:/usr/local/dcmi \ |
| 17 | +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ |
| 18 | +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ |
| 19 | +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ |
| 20 | +-v /etc/ascend_install.info:/etc/ascend_install.info \ |
| 21 | +-v /root/models:/root/models \ |
| 22 | +-p 8000:8000 \ |
| 23 | +-it quay.io/ascend/vllm-ascend:latest bash |
| 24 | +``` |
| 25 | + |
| 26 | +Use Modelscope mirror to speed up model download: |
| 27 | + |
| 28 | +```bash |
| 29 | +pip install modelscope |
| 30 | +export VLLM_USE_MODELSCOPE=True |
| 31 | +export MODELSCOPE_CACHE=/root/models/ |
| 32 | +``` |
| 33 | + |
| 34 | +To avoid NPU out of memory, set `max_split_size_mb` to any value lower than you need to allocate: |
| 35 | + |
| 36 | +```bash |
| 37 | +export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 |
| 38 | +``` |
| 39 | + |
| 40 | +> [!NOTE] |
| 41 | +> `max_split_size_mb` prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. You can find more details [<u>here</u>](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/apiref/envref/envref_07_0061.html). |
| 42 | +
|
| 43 | +Run the following script to execute offline inference on a single NPU: |
| 44 | + |
| 45 | +```python |
| 46 | +from vllm import LLM, SamplingParams |
| 47 | + |
| 48 | +prompts = [ |
| 49 | + "Hello, my name is", |
| 50 | + "The future of AI is", |
| 51 | +] |
| 52 | +sampling_params = SamplingParams(temperature=0.8, top_p=0.95) |
| 53 | +llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", max_model_len=26240) |
| 54 | + |
| 55 | +outputs = llm.generate(prompts, sampling_params) |
| 56 | +for output in outputs: |
| 57 | + prompt = output.prompt |
| 58 | + generated_text = output.outputs[0].text |
| 59 | + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
| 60 | +``` |
| 61 | + |
| 62 | +If you run this script successfully, you can see the info shown below: |
| 63 | + |
| 64 | +```bash |
| 65 | +Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I' |
| 66 | +Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the' |
| 67 | +``` |
| 68 | + |
| 69 | +### Online Serving on a Single NPU |
| 70 | + |
| 71 | +Run docker container: |
| 72 | + |
| 73 | +```bash |
| 74 | +docker run \ |
| 75 | +--name vllm-ascend \ |
| 76 | +--device /dev/davinci0 \ |
| 77 | +--device /dev/davinci_manager \ |
| 78 | +--device /dev/devmm_svm \ |
| 79 | +--device /dev/hisi_hdc \ |
| 80 | +-v /usr/local/dcmi:/usr/local/dcmi \ |
| 81 | +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ |
| 82 | +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ |
| 83 | +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ |
| 84 | +-v /etc/ascend_install.info:/etc/ascend_install.info \ |
| 85 | +-v /root/models:/root/models \ |
| 86 | +-p 8000:8000 \ |
| 87 | +-it quay.io/ascend/vllm-ascend:latest bash |
| 88 | +``` |
| 89 | + |
| 90 | +Use Modelscope mirror to speed up model download: |
| 91 | + |
| 92 | +```bash |
| 93 | +pip install modelscope |
| 94 | +export VLLM_USE_MODELSCOPE=True |
| 95 | +export MODELSCOPE_CACHE=/root/models/ |
| 96 | +``` |
| 97 | + |
| 98 | +Set `max_split_size_mb` to any value lower than you need to allocate: |
| 99 | + |
| 100 | +```bash |
| 101 | +export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 |
| 102 | +``` |
| 103 | + |
| 104 | +Start the vLLM server on a single NPU: |
| 105 | + |
| 106 | +``` |
| 107 | +vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240 |
| 108 | +``` |
| 109 | + |
| 110 | +> [!NOTE] |
| 111 | +> Add `--max_model_len` option to avoid ValueError that the Qwen2.5-7B model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (26240). |
| 112 | +
|
| 113 | +Once your server is started, you can query the model with input prompts: |
| 114 | + |
| 115 | +```bash |
| 116 | +curl http://localhost:8000/v1/completions \ |
| 117 | + -H "Content-Type: application/json" \ |
| 118 | + -d '{ |
| 119 | + "model": "Qwen/Qwen2.5-7B-Instruct", |
| 120 | + "prompt": "The future of AI is", |
| 121 | + "max_tokens": 7, |
| 122 | + "temperature": 0 |
| 123 | + }' |
| 124 | +``` |
| 125 | + |
| 126 | +If you query the server successfully, you can see the info shown below (client): |
| 127 | + |
| 128 | +```bash |
| 129 | +{"id":"cmpl-b25a59a2f985459781ce7098aeddfda7","object":"text_completion","created":1739523925,"model":"Qwen/Qwen2.5-7B-Instruct","choices":[{"index":0,"text":" here. It’s not just a","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}} |
| 130 | +``` |
| 131 | + |
| 132 | +Logs of the vllm server: |
| 133 | + |
| 134 | +```bash |
| 135 | +INFO: 172.17.0.1:49518 - "POST /v1/completions HTTP/1.1" 200 OK |
| 136 | +INFO 02-13 08:34:35 logger.py:39] Received request cmpl-574f00e342904692a73fb6c1c986c521-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None. |
| 137 | +``` |
| 138 | + |
| 139 | +## Run vllm-ascend on Multi-NPU |
| 140 | + |
| 141 | +### Distributed Inference on Multi-NPU |
| 142 | + |
| 143 | +Run docker container: |
| 144 | + |
| 145 | +```bash |
| 146 | +docker run \ |
| 147 | +--name vllm-ascend \ |
| 148 | +--device /dev/davinci0 \ |
| 149 | +--device /dev/davinci1 \ |
| 150 | +--device /dev/davinci_manager \ |
| 151 | +--device /dev/devmm_svm \ |
| 152 | +--device /dev/hisi_hdc \ |
| 153 | +-v /usr/local/dcmi:/usr/local/dcmi \ |
| 154 | +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ |
| 155 | +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ |
| 156 | +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ |
| 157 | +-v /etc/ascend_install.info:/etc/ascend_install.info \ |
| 158 | +-v /root/models:/root/models \ |
| 159 | +-p 8000:8000 \ |
| 160 | +-it quay.io/ascend/vllm-ascend:latest bash |
| 161 | +``` |
| 162 | + |
| 163 | +Use Modelscope mirror to speed up model download: |
| 164 | + |
| 165 | +```bash |
| 166 | +pip install modelscope |
| 167 | +export VLLM_USE_MODELSCOPE=True |
| 168 | +export MODELSCOPE_CACHE=/root/models/ |
| 169 | +``` |
| 170 | + |
| 171 | +Set `max_split_size_mb` to any value lower than you need to allocate: |
| 172 | + |
| 173 | +```bash |
| 174 | +export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 |
| 175 | +``` |
| 176 | + |
| 177 | +Run the following script to execute offline inference on multi-NPU: |
| 178 | + |
| 179 | +```python |
| 180 | +from vllm import LLM, SamplingParams |
| 181 | + |
| 182 | +def clean_up(): |
| 183 | + destroy_model_parallel() |
| 184 | + destroy_distributed_environment() |
| 185 | + gc.collect() |
| 186 | + torch.npu.empty_cache() |
| 187 | + |
| 188 | +prompts = [ |
| 189 | + "Hello, my name is", |
| 190 | + "The future of AI is", |
| 191 | +] |
| 192 | +sampling_params = SamplingParams(temperature=0.8, top_p=0.95) |
| 193 | +llm = LLM(model="Qwen/Qwen2.5-7B-Instruct", |
| 194 | + tensor_parallel_size=2, |
| 195 | + distributed_executor_backend="mp", |
| 196 | + max_model_len=26240) |
| 197 | + |
| 198 | +outputs = llm.generate(prompts, sampling_params) |
| 199 | +for output in outputs: |
| 200 | + prompt = output.prompt |
| 201 | + generated_text = output.outputs[0].text |
| 202 | + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
| 203 | + |
| 204 | +del llm |
| 205 | +clean_up() |
| 206 | +``` |
| 207 | + |
| 208 | +If you run this script successfully, you can see the info shown below: |
| 209 | + |
| 210 | +```bash |
| 211 | +Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I' |
| 212 | +Prompt: 'The future of AI is', Generated text: ' following you. As the technology advances, a new report from the Institute for the' |
| 213 | +``` |
0 commit comments