Skip to content

Commit 88616b5

Browse files
reidliu41amitm02
authored andcommitted
[doc] improve readability (vllm-project#18675)
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Signed-off-by: amit <amit.man@gmail.com>
1 parent 4e2c5c0 commit 88616b5

File tree

20 files changed

+206
-59
lines changed

20 files changed

+206
-59
lines changed

docs/contributing/dockerfile/dockerfile.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,12 @@ The edges of the build graph represent:
2626
> Commands to regenerate the build graph (make sure to run it **from the \`root\` directory of the vLLM repository** where the dockerfile is present):
2727
>
2828
> ```bash
29-
> dockerfilegraph -o png --legend --dpi 200 --max-label-length 50 --filename docker/Dockerfile
29+
> dockerfilegraph \
30+
> -o png \
31+
> --legend \
32+
> --dpi 200 \
33+
> --max-label-length 50 \
34+
> --filename docker/Dockerfile
3035
> ```
3136
>
3237
> or in case you want to run it directly with the docker image:

docs/contributing/model/registration.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,10 @@ If your model imports modules that initialize CUDA, consider lazy-importing it t
4141
```python
4242
from vllm import ModelRegistry
4343

44-
ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
44+
ModelRegistry.register_model(
45+
"YourModelForCausalLM",
46+
"your_code:YourModelForCausalLM"
47+
)
4548
```
4649

4750
!!! warning

docs/deployment/docker.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ vLLM offers an official Docker image for deployment.
1111
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
1212

1313
```console
14-
$ docker run --runtime nvidia --gpus all \
14+
docker run --runtime nvidia --gpus all \
1515
-v ~/.cache/huggingface:/root/.cache/huggingface \
1616
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
1717
-p 8000:8000 \
@@ -23,7 +23,7 @@ $ docker run --runtime nvidia --gpus all \
2323
This image can also be used with other container engines such as [Podman](https://podman.io/).
2424

2525
```console
26-
$ podman run --gpus all \
26+
podman run --gpus all \
2727
-v ~/.cache/huggingface:/root/.cache/huggingface \
2828
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
2929
-p 8000:8000 \
@@ -73,7 +73,10 @@ You can build and run vLLM from source via the provided <gh-file:docker/Dockerfi
7373

7474
```console
7575
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
76-
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai --file docker/Dockerfile
76+
DOCKER_BUILDKIT=1 docker build . \
77+
--target vllm-openai \
78+
--tag vllm/vllm-openai \
79+
--file docker/Dockerfile
7780
```
7881

7982
!!! note
@@ -96,8 +99,8 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
9699

97100
```console
98101
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
99-
$ python3 use_existing_torch.py
100-
$ DOCKER_BUILDKIT=1 docker build . \
102+
python3 use_existing_torch.py
103+
DOCKER_BUILDKIT=1 docker build . \
101104
--file docker/Dockerfile \
102105
--target vllm-openai \
103106
--platform "linux/arm64" \
@@ -113,7 +116,7 @@ $ DOCKER_BUILDKIT=1 docker build . \
113116
To run vLLM with the custom-built Docker image:
114117

115118
```console
116-
$ docker run --runtime nvidia --gpus all \
119+
docker run --runtime nvidia --gpus all \
117120
-v ~/.cache/huggingface:/root/.cache/huggingface \
118121
-p 8000:8000 \
119122
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \

docs/deployment/frameworks/skypilot.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,11 @@ Check the output of the command. There will be a shareable gradio link (like the
8282
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
8383

8484
```console
85-
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
85+
HF_TOKEN="your-huggingface-token" \
86+
sky launch serving.yaml \
87+
--gpus A100:8 \
88+
--env HF_TOKEN \
89+
--env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
8690
```
8791

8892
## Scale up to multiple replicas
@@ -155,7 +159,9 @@ run: |
155159
Start the serving the Llama-3 8B model on multiple replicas:
156160
157161
```console
158-
HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
162+
HF_TOKEN="your-huggingface-token" \
163+
sky serve up -n vllm serving.yaml \
164+
--env HF_TOKEN
159165
```
160166

161167
Wait until the service is ready:
@@ -318,7 +324,9 @@ run: |
318324
1. Start the chat web UI:
319325
320326
```console
321-
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
327+
sky launch \
328+
-c gui ./gui.yaml \
329+
--env ENDPOINT=$(sky serve status --endpoint vllm)
322330
```
323331

324332
2. Then, we can access the GUI at the returned gradio link:

docs/deployment/frameworks/streamlit.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,8 @@ pip install streamlit openai
3333
streamlit run streamlit_openai_chatbot_webserver.py
3434

3535
# or specify the VLLM_API_BASE or VLLM_API_KEY
36-
VLLM_API_BASE="http://vllm-server-host:vllm-server-port/v1" streamlit run streamlit_openai_chatbot_webserver.py
36+
VLLM_API_BASE="http://vllm-server-host:vllm-server-port/v1" \
37+
streamlit run streamlit_openai_chatbot_webserver.py
3738

3839
# start with debug mode to view more details
3940
streamlit run streamlit_openai_chatbot_webserver.py --logger.level=debug

docs/deployment/nginx.md

Lines changed: 31 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,11 @@ If you are behind proxy, you can pass the proxy settings to the docker build com
7777

7878
```console
7979
cd $vllm_root
80-
docker build -f docker/Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
80+
docker build \
81+
-f docker/Dockerfile . \
82+
--tag vllm \
83+
--build-arg http_proxy=$http_proxy \
84+
--build-arg https_proxy=$https_proxy
8185
```
8286

8387
[](){ #nginxloadbalancer-nginx-docker-network }
@@ -102,8 +106,26 @@ Notes:
102106
```console
103107
mkdir -p ~/.cache/huggingface/hub/
104108
hf_cache_dir=~/.cache/huggingface/
105-
docker run -itd --ipc host --network vllm_nginx --gpus device=0 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
106-
docker run -itd --ipc host --network vllm_nginx --gpus device=1 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
109+
docker run \
110+
-itd \
111+
--ipc host \
112+
--network vllm_nginx \
113+
--gpus device=0 \
114+
--shm-size=10.24gb \
115+
-v $hf_cache_dir:/root/.cache/huggingface/ \
116+
-p 8081:8000 \
117+
--name vllm0 vllm \
118+
--model meta-llama/Llama-2-7b-chat-hf
119+
docker run \
120+
-itd \
121+
--ipc host \
122+
--network vllm_nginx \
123+
--gpus device=1 \
124+
--shm-size=10.24gb \
125+
-v $hf_cache_dir:/root/.cache/huggingface/ \
126+
-p 8082:8000 \
127+
--name vllm1 vllm \
128+
--model meta-llama/Llama-2-7b-chat-hf
107129
```
108130

109131
!!! note
@@ -114,7 +136,12 @@ docker run -itd --ipc host --network vllm_nginx --gpus device=1 --shm-size=10.24
114136
## Launch Nginx
115137

116138
```console
117-
docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest
139+
docker run \
140+
-itd \
141+
-p 8000:80 \
142+
--network vllm_nginx \
143+
-v ./nginx_conf/:/etc/nginx/conf.d/ \
144+
--name nginx-lb nginx-lb:latest
118145
```
119146

120147
[](){ #nginxloadbalancer-nginx-verify-nginx }

docs/features/quantization/auto_awq.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,9 @@ print(f'Model is quantized and saved at "{quant_path}"')
4242
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
4343

4444
```console
45-
python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
45+
python examples/offline_inference/llm_engine_example.py \
46+
--model TheBloke/Llama-2-7b-Chat-AWQ \
47+
--quantization awq
4648
```
4749

4850
AWQ models are also supported directly through the LLM entrypoint:

docs/features/quantization/bitblas.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,12 @@ import torch
3333

3434
# "hxbgsyxh/llama-13b-4bit-g-1-bitblas" is a pre-quantized checkpoint.
3535
model_id = "hxbgsyxh/llama-13b-4bit-g-1-bitblas"
36-
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, quantization="bitblas")
36+
llm = LLM(
37+
model=model_id,
38+
dtype=torch.bfloat16,
39+
trust_remote_code=True,
40+
quantization="bitblas"
41+
)
3742
```
3843

3944
## Read gptq format checkpoint
@@ -44,5 +49,11 @@ import torch
4449

4550
# "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
4651
model_id = "hxbgsyxh/llama-13b-4bit-g-1"
47-
llm = LLM(model=model_id, dtype=torch.float16, trust_remote_code=True, quantization="bitblas", max_model_len=1024)
52+
llm = LLM(
53+
model=model_id,
54+
dtype=torch.float16,
55+
trust_remote_code=True,
56+
quantization="bitblas",
57+
max_model_len=1024
58+
)
4859
```

docs/features/quantization/bnb.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,11 @@ from vllm import LLM
2727
import torch
2828
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
2929
model_id = "unsloth/tinyllama-bnb-4bit"
30-
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True)
30+
llm = LLM(
31+
model=model_id,
32+
dtype=torch.bfloat16,
33+
trust_remote_code=True
34+
)
3135
```
3236

3337
## Inflight quantization: load as 4bit quantization
@@ -38,8 +42,12 @@ For inflight 4bit quantization with BitsAndBytes, you need to explicitly specify
3842
from vllm import LLM
3943
import torch
4044
model_id = "huggyllama/llama-7b"
41-
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
42-
quantization="bitsandbytes")
45+
llm = LLM(
46+
model=model_id,
47+
dtype=torch.bfloat16,
48+
trust_remote_code=True,
49+
quantization="bitsandbytes"
50+
)
4351
```
4452

4553
## OpenAI Compatible Server

docs/features/quantization/gguf.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,14 +14,17 @@ To run a GGUF model with vLLM, you can download and use the local GGUF model fro
1414
```console
1515
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
1616
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
17-
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
17+
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
18+
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
1819
```
1920

2021
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
2122

2223
```console
2324
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
24-
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
25+
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
26+
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
27+
--tensor-parallel-size 2
2528
```
2629

2730
!!! warning
@@ -31,7 +34,9 @@ GGUF assumes that huggingface can convert the metadata to a config file. In case
3134

3235
```console
3336
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
34-
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --hf-config-path Tinyllama/TInyLlama-1.1B-Chat-v1.0
37+
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
38+
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
39+
--hf-config-path Tinyllama/TInyLlama-1.1B-Chat-v1.0
3540
```
3641

3742
You can also use the GGUF model directly through the LLM entrypoint:

0 commit comments

Comments
 (0)