Skip to content

Commit 0daae74

Browse files
KuntaiDuAkshat-Tripathi
authored andcommitted
[Documentation] Add more deployment guide for Kubernetes deployment (vllm-project#13841)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
1 parent 39a2024 commit 0daae74

File tree

3 files changed

+166
-7
lines changed

3 files changed

+166
-7
lines changed

docs/source/deployment/integrations/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,5 @@ kserve
77
kubeai
88
llamastack
99
llmaz
10+
production-stack
1011
:::
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
(deployment-production-stack)=
2+
3+
# Production stack
4+
5+
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:
6+
7+
* **Upstream vLLM compatibility** – It wraps around upstream vLLM without modifying its code.
8+
* **Ease of use** – Simplified deployment via Helm charts and observability through Grafana dashboards.
9+
* **High performance** – Optimized for LLM workloads with features like multi-model support, model-aware and prefix-aware routing, fast vLLM bootstrapping, and KV cache offloading with [LMCache](https://github.com/LMCache/LMCache), among others.
10+
11+
If you are new to Kubernetes, don't worry: in the vLLM production stack [repo](https://github.com/vllm-project/production-stack), we provide a step-by-step [guide](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) and a [short video](https://www.youtube.com/watch?v=EsTJbQtzj0g) to set up everything and get started in **4 minutes**!
12+
13+
## Pre-requisite
14+
15+
Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine).
16+
17+
## Deployment using vLLM production stack
18+
19+
The standard vLLM production stack install uses a Helm chart. You can run this [bash script](https://github.com/vllm-project/production-stack/blob/main/tutorials/install-helm.sh) to install Helm on your GPU server.
20+
21+
To install the vLLM production stack, run the following commands on your desktop:
22+
23+
```bash
24+
sudo helm repo add vllm https://vllm-project.github.io/production-stack
25+
sudo helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml
26+
```
27+
28+
This will instantiate a vLLM-production-stack-based deployment named `vllm` that runs a small LLM (Facebook opt-125M model).
29+
30+
### Validate Installation
31+
32+
Monitor the deployment status using:
33+
34+
```bash
35+
sudo kubectl get pods
36+
```
37+
38+
And you will see that pods for the `vllm` deployment will transit to `Running` state.
39+
40+
```text
41+
NAME READY STATUS RESTARTS AGE
42+
vllm-deployment-router-859d8fb668-2x2b7 1/1 Running 0 2m38s
43+
vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs 1/1 Running 0 2m38s
44+
```
45+
46+
**NOTE**: It may take some time for the containers to download the Docker images and LLM weights.
47+
48+
### Send a Query to the Stack
49+
50+
Forward the `vllm-router-service` port to the host machine:
51+
52+
```bash
53+
sudo kubectl port-forward svc/vllm-router-service 30080:80
54+
```
55+
56+
And then you can send out a query to the OpenAI-compatible API to check the available models:
57+
58+
```bash
59+
curl -o- http://localhost:30080/models
60+
```
61+
62+
Expected output:
63+
64+
```json
65+
{
66+
"object": "list",
67+
"data": [
68+
{
69+
"id": "facebook/opt-125m",
70+
"object": "model",
71+
"created": 1737428424,
72+
"owned_by": "vllm",
73+
"root": null
74+
}
75+
]
76+
}
77+
```
78+
79+
To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint:
80+
81+
```bash
82+
curl -X POST http://localhost:30080/completions \
83+
-H "Content-Type: application/json" \
84+
-d '{
85+
"model": "facebook/opt-125m",
86+
"prompt": "Once upon a time,",
87+
"max_tokens": 10
88+
}'
89+
```
90+
91+
Expected output:
92+
93+
```json
94+
{
95+
"id": "completion-id",
96+
"object": "text_completion",
97+
"created": 1737428424,
98+
"model": "facebook/opt-125m",
99+
"choices": [
100+
{
101+
"text": " there was a brave knight who...",
102+
"index": 0,
103+
"finish_reason": "length"
104+
}
105+
]
106+
}
107+
```
108+
109+
### Uninstall
110+
111+
To remove the deployment, run:
112+
113+
```bash
114+
sudo helm uninstall vllm
115+
```
116+
117+
------
118+
119+
### (Advanced) Configuring vLLM production stack
120+
121+
The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:
122+
123+
```yaml
124+
servingEngineSpec:
125+
runtimeClassName: ""
126+
modelSpec:
127+
- name: "opt125m"
128+
repository: "vllm/vllm-openai"
129+
tag: "latest"
130+
modelURL: "facebook/opt-125m"
131+
132+
replicaCount: 1
133+
134+
requestCPU: 6
135+
requestMemory: "16Gi"
136+
requestGPU: 1
137+
138+
pvcStorage: "10Gi"
139+
```
140+
141+
In this YAML configuration:
142+
* **`modelSpec`** includes:
143+
* `name`: A nickname that you prefer to call the model.
144+
* `repository`: Docker repository of vLLM.
145+
* `tag`: Docker image tag.
146+
* `modelURL`: The LLM model that you want to use.
147+
* **`replicaCount`**: Number of replicas.
148+
* **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod.
149+
* **`requestGPU`**: Specifies the number of GPUs required.
150+
* **`pvcStorage`**: Allocates persistent storage for the model.
151+
152+
**NOTE:** If you intend to set up two pods, please refer to this [YAML file](https://github.com/vllm-project/production-stack/blob/main/tutorials/assets/values-01-2pods-minimal-example.yaml).
153+
154+
**NOTE:** vLLM production stack offers many more features (*e.g.* CPU offloading and a wide range of routing algorithms). Please check out these [examples and tutorials](https://github.com/vllm-project/production-stack/tree/main/tutorials) and our [repo](https://github.com/vllm-project/production-stack) for more details!

docs/source/deployment/k8s.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,21 @@
22

33
# Using Kubernetes
44

5-
Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
5+
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes.
66

7-
## Prerequisites
7+
--------
88

9-
Before you begin, ensure that you have the following:
9+
Alternatively, you can also deploy Kubernetes using [helm chart](https://docs.vllm.ai/en/latest/deployment/frameworks/helm.html). There are also open-source projects available to make your deployment even smoother.
1010

11-
- A running Kubernetes cluster
12-
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
13-
- Available GPU resources in your cluster
11+
* [vLLM production-stack](https://github.com/vllm-project/production-stack): Born out of a Berkeley-UChicago collaboration, vLLM production stack is a project that contains latest research and community effort, while still delivering production-level stability and performance. Checkout the [documentation page](https://docs.vllm.ai/en/latest/deployment/integrations/production-stack.html) for more details and examples.
1412

15-
## Deployment Steps
13+
--------
14+
15+
## Pre-requisite
16+
17+
Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine).
18+
19+
## Deployment using native K8s
1620

1721
1. Create a PVC, Secret and Deployment for vLLM
1822

0 commit comments

Comments
 (0)