Skip to content

Commit fe826e0

Browse files
committed
add performance tuning doc to main
Signed-off-by: shen-shanshan <467638484@qq.com>
1 parent 9cbce42 commit fe826e0

File tree

3 files changed

+165
-0
lines changed

3 files changed

+165
-0
lines changed
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Performance
2+
3+
:::{toctree}
4+
:caption: Optimization
5+
:maxdepth: 1
6+
optimization_and_tuning
7+
:::
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# Optimization and Tuning
2+
3+
This guide aims to help users to improve vllm-ascend performance on system level. It includes OS configuration, library optimization, deploy guide and so on. Any feedback is welcome.
4+
5+
## Preparation
6+
7+
Run the container:
8+
9+
```bash
10+
# Update DEVICE according to your device (/dev/davinci[0-7])
11+
export DEVICE=/dev/davinci0
12+
# Update the cann base image
13+
export IMAGE=m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10
14+
docker run --rm \
15+
--name performance-test \
16+
--device $DEVICE \
17+
--device /dev/davinci_manager \
18+
--device /dev/devmm_svm \
19+
--device /dev/hisi_hdc \
20+
-v /usr/local/dcmi:/usr/local/dcmi \
21+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
22+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
23+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
24+
-v /etc/ascend_install.info:/etc/ascend_install.info \
25+
-v /root/.cache:/root/.cache \
26+
-it $IMAGE bash
27+
```
28+
29+
Configure your environment:
30+
31+
```bash
32+
# Configure the mirror
33+
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" > /etc/apt/sources.list && \
34+
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" >> /etc/apt/sources.list && \
35+
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
36+
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
37+
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
38+
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
39+
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list && \
40+
echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list
41+
42+
# Install os packages
43+
apt update && apt install wget gcc g++ libnuma-dev git vim -y
44+
```
45+
46+
## Optimizations
47+
48+
### 1. Compilation Optimization
49+
50+
#### 1.1. Install optimized `python`
51+
52+
Python supports **LTO** and **PGO** optimization starting from version `3.6` and above, which can be enabled at compile time. And we have offered compilation optimized `python` packages direcctly to users for the sake of convenience. You can also reproduce the `python` build follow this [tutorial](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0063.html) according to your specific scenarios.
53+
54+
```bash
55+
mkdir -p /workspace/tmp
56+
cd /workspace/tmp
57+
58+
# Download prebuilt lib and packages
59+
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libcrypto.so.1.1
60+
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libomp.so
61+
wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libssl.so.1.1
62+
wget https://repo.oepkgs.net/ascend/pytorch/vllm/python/py311_bisheng.tar.gz
63+
64+
# Configure python and pip
65+
cp ./*.so* /usr/local/lib
66+
tar -zxvf ./py311_bisheng.* -C /usr/local/
67+
mv /usr/local/py311_bisheng/ /usr/local/python
68+
sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3
69+
sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3.11
70+
ln -sf /usr/local/python/bin/python3 /usr/bin/python
71+
ln -sf /usr/local/python/bin/python3 /usr/bin/python3
72+
ln -sf /usr/local/python/bin/python3.11 /usr/bin/python3.11
73+
ln -sf /usr/local/python/bin/pip3 /usr/bin/pip3
74+
ln -sf /usr/local/python/bin/pip3 /usr/bin/pip
75+
76+
export PATH=/usr/bin:/usr/local/python/bin:$PATH
77+
```
78+
79+
### 2. OS Optimization
80+
81+
#### 2.1. jemalloc
82+
83+
**jemalloc** is a memory allocator that improves performance for multi-threads scenario and can reduce memory fragment. jemalloc use thread local memory manager to allocate variables, which can avoid lock competition between multi-threads and can hugely optimize performance.
84+
85+
```bash
86+
# Install jemalloc
87+
sudo apt update
88+
sudo apt install libjemalloc2
89+
90+
# Configure jemalloc
91+
export LD_PRELOAD=/usr/lib/"$(uname -i)"-linux-gnu/libjemalloc.so.2 $LD_PRELOAD
92+
```
93+
94+
#### 2.2. Tcmalloc
95+
96+
**Tcmalloc (Thread Counting Malloc)** is a universal memory allocator that improves overall performance while ensuring low latency by introducing a multi-level cache structure, reducing mutex competition and optimizing large object processing flow. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/700/ptmoddevg/trainingmigrguide/performance_tuning_0068.html).
97+
98+
```bash
99+
# Install tcmalloc
100+
sudo apt update
101+
sudo apt install libgoogle-perftools4 libgoogle-perftools-dev
102+
103+
# Get the location of libtcmalloc.so*
104+
find /usr -name libtcmalloc.so*
105+
106+
# Make the priority of tcmalloc higher
107+
# The <path> is the location of libtcmalloc.so we get from the upper command
108+
# Example: "$LD_PRELOAD:/usr/lib/aarch64-linux-gnu/libtcmalloc.so"
109+
export LD_PRELOAD="$LD_PRELOAD:<path>"
110+
111+
# Verify your configuration
112+
# The path of libtcmalloc.so will be contained in the result if your configuration is valid
113+
ldd `which python`
114+
```
115+
116+
### 3. `torch_npu` Optimization
117+
118+
Some performance tuning features in `torch_npu` are controlled by environment variables. Some features and their related environment variables are shown below.
119+
120+
Memory optimization:
121+
122+
```bash
123+
# Upper limit of memory block splitting allowed (MB), Setting this parameter can prevent large memory blocks from being split.
124+
export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"
125+
126+
# When operators on the communication stream have dependencies, they all need to be ended before being released for reuse. The logic of multi-stream reuse is to release the memory on the communication stream in advance so that the computing stream can be reused.
127+
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
128+
```
129+
130+
Schedule optimization:
131+
132+
```bash
133+
# Optimize operator delivery queue, this will affect the memory peak value, and may degrade if the memory is tight.
134+
export TASK_QUEUE_ENABLE=2
135+
136+
# This will greatly improve the CPU bottleneck model and ensure the same performance for the NPU bottleneck model.
137+
export CPU_AFFINITY_CONF=1
138+
```
139+
140+
### 4. CANN Optimization
141+
142+
#### 4.1. HCCL Optimization
143+
144+
There are some performance tuning features in HCCL, which are controled by environment variables.
145+
146+
You can configure HCCL to use "AIV" mode to optimize performance by setting the environment variable shown below. In "AIV" mode, the communication is scheduled by AI vector core directly with ROCE, instead of being scheduled by AI cpu.
147+
148+
```bash
149+
export HCCL_OP_EXPANSION_MODE="AIV"
150+
```
151+
152+
Plus, there are more features for performance optimization in specific scenarios, which are shown below.
153+
154+
- `HCCL_INTRA_ROCE_ENABLE`: Use RDMA link instead of SDMA link between two 8Ps as the mesh interconnect link, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0044.html).
155+
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA network card, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
156+
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA network card, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
157+
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs, find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).

docs/source/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ user_guide/release_notes
5858
:maxdepth: 1
5959
developer_guide/contributing
6060
developer_guide/versioning_policy
61+
developer_guide/performance/index
6162
developer_guide/evaluation/index
6263
:::
6364

0 commit comments

Comments
 (0)