Skip to content

Commit 7297941

Browse files
authored
[Doc] Update LWS docs (#15163)
Signed-off-by: Edwinhr716 <Edandres249@gmail.com>
1 parent f8a08cb commit 7297941

File tree

1 file changed

+189
-2
lines changed
  • docs/source/deployment/frameworks

1 file changed

+189
-2
lines changed

docs/source/deployment/frameworks/lws.md

Lines changed: 189 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,192 @@ A major use case is for multi-host/multi-node distributed inference.
77

88
vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kubernetes for distributed model serving.
99

10-
Please see [this guide](https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm) for more details on
11-
deploying vLLM on Kubernetes using LWS.
10+
## Prerequisites
11+
12+
* At least two Kubernetes nodes, each with 8 GPUs, are required.
13+
* Install LWS by following the instructions found [here](https://lws.sigs.k8s.io/docs/installation/).
14+
15+
## Deploy and Serve
16+
17+
Deploy the following yaml file `lws.yaml`
18+
19+
```yaml
20+
apiVersion: leaderworkerset.x-k8s.io/v1
21+
kind: LeaderWorkerSet
22+
metadata:
23+
name: vllm
24+
spec:
25+
replicas: 2
26+
leaderWorkerTemplate:
27+
size: 2
28+
restartPolicy: RecreateGroupOnPodRestart
29+
leaderTemplate:
30+
metadata:
31+
labels:
32+
role: leader
33+
spec:
34+
containers:
35+
- name: vllm-leader
36+
image: docker.io/vllm/vllm-openai:latest
37+
env:
38+
- name: HUGGING_FACE_HUB_TOKEN
39+
value: <your-hf-token>
40+
command:
41+
- sh
42+
- -c
43+
- "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
44+
python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"
45+
resources:
46+
limits:
47+
nvidia.com/gpu: "8"
48+
memory: 1124Gi
49+
ephemeral-storage: 800Gi
50+
requests:
51+
ephemeral-storage: 800Gi
52+
cpu: 125
53+
ports:
54+
- containerPort: 8080
55+
readinessProbe:
56+
tcpSocket:
57+
port: 8080
58+
initialDelaySeconds: 15
59+
periodSeconds: 10
60+
volumeMounts:
61+
- mountPath: /dev/shm
62+
name: dshm
63+
volumes:
64+
- name: dshm
65+
emptyDir:
66+
medium: Memory
67+
sizeLimit: 15Gi
68+
workerTemplate:
69+
spec:
70+
containers:
71+
- name: vllm-worker
72+
image: docker.io/vllm/vllm-openai:latest
73+
command:
74+
- sh
75+
- -c
76+
- "bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
77+
resources:
78+
limits:
79+
nvidia.com/gpu: "8"
80+
memory: 1124Gi
81+
ephemeral-storage: 800Gi
82+
requests:
83+
ephemeral-storage: 800Gi
84+
cpu: 125
85+
env:
86+
- name: HUGGING_FACE_HUB_TOKEN
87+
value: <your-hf-token>
88+
volumeMounts:
89+
- mountPath: /dev/shm
90+
name: dshm
91+
volumes:
92+
- name: dshm
93+
emptyDir:
94+
medium: Memory
95+
sizeLimit: 15Gi
96+
---
97+
apiVersion: v1
98+
kind: Service
99+
metadata:
100+
name: vllm-leader
101+
spec:
102+
ports:
103+
- name: http
104+
port: 8080
105+
protocol: TCP
106+
targetPort: 8080
107+
selector:
108+
leaderworkerset.sigs.k8s.io/name: vllm
109+
role: leader
110+
type: ClusterIP
111+
```
112+
113+
```bash
114+
kubectl apply -f lws.yaml
115+
```
116+
117+
Verify the status of the pods:
118+
119+
```bash
120+
kubectl get pods
121+
```
122+
123+
Should get an output similar to this:
124+
125+
```bash
126+
NAME READY STATUS RESTARTS AGE
127+
vllm-0 1/1 Running 0 2s
128+
vllm-0-1 1/1 Running 0 2s
129+
vllm-1 1/1 Running 0 2s
130+
vllm-1-1 1/1 Running 0 2s
131+
```
132+
133+
Verify that the distributed tensor-parallel inference works:
134+
135+
```bash
136+
kubectl logs vllm-0 |grep -i "Loading model weights took"
137+
```
138+
139+
Should get something similar to this:
140+
141+
```text
142+
INFO 05-08 03:20:24 model_runner.py:173] Loading model weights took 0.1189 GB
143+
(RayWorkerWrapper pid=169, ip=10.20.0.197) INFO 05-08 03:20:28 model_runner.py:173] Loading model weights took 0.1189 GB
144+
```
145+
146+
## Access ClusterIP service
147+
148+
```bash
149+
# Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service
150+
kubectl port-forward svc/vllm-leader 8080:8080
151+
```
152+
153+
The output should be similar to the following:
154+
155+
```text
156+
Forwarding from 127.0.0.1:8080 -> 8080
157+
Forwarding from [::1]:8080 -> 8080
158+
```
159+
160+
## Serve the model
161+
162+
Open another terminal and send a request
163+
164+
```text
165+
curl http://localhost:8080/v1/completions \
166+
-H "Content-Type: application/json" \
167+
-d '{
168+
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
169+
"prompt": "San Francisco is a",
170+
"max_tokens": 7,
171+
"temperature": 0
172+
}'
173+
```
174+
175+
The output should be similar to the following
176+
177+
```text
178+
{
179+
"id": "cmpl-1bb34faba88b43f9862cfbfb2200949d",
180+
"object": "text_completion",
181+
"created": 1715138766,
182+
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
183+
"choices": [
184+
{
185+
"index": 0,
186+
"text": " top destination for foodies, with",
187+
"logprobs": null,
188+
"finish_reason": "length",
189+
"stop_reason": null
190+
}
191+
],
192+
"usage": {
193+
"prompt_tokens": 5,
194+
"total_tokens": 12,
195+
"completion_tokens": 7
196+
}
197+
}
198+
```

0 commit comments

Comments
 (0)