@@ -7,5 +7,192 @@ A major use case is for multi-host/multi-node distributed inference.
77
88vLLM can be deployed with [ LWS] ( https://github.com/kubernetes-sigs/lws ) on Kubernetes for distributed model serving.
99
10- Please see [ this guide] ( https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm ) for more details on
11- deploying vLLM on Kubernetes using LWS.
10+ ## Prerequisites
11+
12+ * At least two Kubernetes nodes, each with 8 GPUs, are required.
13+ * Install LWS by following the instructions found [ here] ( https://lws.sigs.k8s.io/docs/installation/ ) .
14+
15+ ## Deploy and Serve
16+
17+ Deploy the following yaml file ` lws.yaml `
18+
19+ ``` yaml
20+ apiVersion : leaderworkerset.x-k8s.io/v1
21+ kind : LeaderWorkerSet
22+ metadata :
23+ name : vllm
24+ spec :
25+ replicas : 2
26+ leaderWorkerTemplate :
27+ size : 2
28+ restartPolicy : RecreateGroupOnPodRestart
29+ leaderTemplate :
30+ metadata :
31+ labels :
32+ role : leader
33+ spec :
34+ containers :
35+ - name : vllm-leader
36+ image : docker.io/vllm/vllm-openai:latest
37+ env :
38+ - name : HUGGING_FACE_HUB_TOKEN
39+ value : <your-hf-token>
40+ command :
41+ - sh
42+ - -c
43+ - " bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE);
44+ python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline_parallel_size 2"
45+ resources :
46+ limits :
47+ nvidia.com/gpu : " 8"
48+ memory : 1124Gi
49+ ephemeral-storage : 800Gi
50+ requests :
51+ ephemeral-storage : 800Gi
52+ cpu : 125
53+ ports :
54+ - containerPort : 8080
55+ readinessProbe :
56+ tcpSocket :
57+ port : 8080
58+ initialDelaySeconds : 15
59+ periodSeconds : 10
60+ volumeMounts :
61+ - mountPath : /dev/shm
62+ name : dshm
63+ volumes :
64+ - name : dshm
65+ emptyDir :
66+ medium : Memory
67+ sizeLimit : 15Gi
68+ workerTemplate :
69+ spec :
70+ containers :
71+ - name : vllm-worker
72+ image : docker.io/vllm/vllm-openai:latest
73+ command :
74+ - sh
75+ - -c
76+ - " bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
77+ resources :
78+ limits :
79+ nvidia.com/gpu : " 8"
80+ memory : 1124Gi
81+ ephemeral-storage : 800Gi
82+ requests :
83+ ephemeral-storage : 800Gi
84+ cpu : 125
85+ env :
86+ - name : HUGGING_FACE_HUB_TOKEN
87+ value : <your-hf-token>
88+ volumeMounts :
89+ - mountPath : /dev/shm
90+ name : dshm
91+ volumes :
92+ - name : dshm
93+ emptyDir :
94+ medium : Memory
95+ sizeLimit : 15Gi
96+ ---
97+ apiVersion : v1
98+ kind : Service
99+ metadata :
100+ name : vllm-leader
101+ spec :
102+ ports :
103+ - name : http
104+ port : 8080
105+ protocol : TCP
106+ targetPort : 8080
107+ selector :
108+ leaderworkerset.sigs.k8s.io/name : vllm
109+ role : leader
110+ type : ClusterIP
111+ ` ` `
112+
113+ ` ` ` bash
114+ kubectl apply -f lws.yaml
115+ ```
116+
117+ Verify the status of the pods:
118+
119+ ``` bash
120+ kubectl get pods
121+ ```
122+
123+ Should get an output similar to this:
124+
125+ ``` bash
126+ NAME READY STATUS RESTARTS AGE
127+ vllm-0 1/1 Running 0 2s
128+ vllm-0-1 1/1 Running 0 2s
129+ vllm-1 1/1 Running 0 2s
130+ vllm-1-1 1/1 Running 0 2s
131+ ```
132+
133+ Verify that the distributed tensor-parallel inference works:
134+
135+ ``` bash
136+ kubectl logs vllm-0 | grep -i " Loading model weights took"
137+ ```
138+
139+ Should get something similar to this:
140+
141+ ``` text
142+ INFO 05-08 03:20:24 model_runner.py:173] Loading model weights took 0.1189 GB
143+ (RayWorkerWrapper pid=169, ip=10.20.0.197) INFO 05-08 03:20:28 model_runner.py:173] Loading model weights took 0.1189 GB
144+ ```
145+
146+ ## Access ClusterIP service
147+
148+ ``` bash
149+ # Listen on port 8080 locally, forwarding to the targetPort of the service's port 8080 in a pod selected by the service
150+ kubectl port-forward svc/vllm-leader 8080:8080
151+ ```
152+
153+ The output should be similar to the following:
154+
155+ ``` text
156+ Forwarding from 127.0.0.1:8080 -> 8080
157+ Forwarding from [::1]:8080 -> 8080
158+ ```
159+
160+ ## Serve the model
161+
162+ Open another terminal and send a request
163+
164+ ``` text
165+ curl http://localhost:8080/v1/completions \
166+ -H "Content-Type: application/json" \
167+ -d '{
168+ "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
169+ "prompt": "San Francisco is a",
170+ "max_tokens": 7,
171+ "temperature": 0
172+ }'
173+ ```
174+
175+ The output should be similar to the following
176+
177+ ``` text
178+ {
179+ "id": "cmpl-1bb34faba88b43f9862cfbfb2200949d",
180+ "object": "text_completion",
181+ "created": 1715138766,
182+ "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
183+ "choices": [
184+ {
185+ "index": 0,
186+ "text": " top destination for foodies, with",
187+ "logprobs": null,
188+ "finish_reason": "length",
189+ "stop_reason": null
190+ }
191+ ],
192+ "usage": {
193+ "prompt_tokens": 5,
194+ "total_tokens": 12,
195+ "completion_tokens": 7
196+ }
197+ }
198+ ```
0 commit comments