|
| 1 | +(deployment-production-stack)= |
| 2 | + |
| 3 | +# Production stack |
| 4 | + |
| 5 | +Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with: |
| 6 | + |
| 7 | +* **Upstream vLLM compatibility** – It wraps around upstream vLLM without modifying its code. |
| 8 | +* **Ease of use** – Simplified deployment via Helm charts and observability through Grafana dashboards. |
| 9 | +* **High performance** – Optimized for LLM workloads with features like multi-model support, model-aware and prefix-aware routing, fast vLLM bootstrapping, and KV cache offloading with [LMCache](https://github.com/LMCache/LMCache), among others. |
| 10 | + |
| 11 | +If you are new to Kubernetes, don't worry: in the vLLM production stack [repo](https://github.com/vllm-project/production-stack), we provide a step-by-step [guide](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) and a [short video](https://www.youtube.com/watch?v=EsTJbQtzj0g) to set up everything and get started in **4 minutes**! |
| 12 | + |
| 13 | +## Pre-requisite |
| 14 | + |
| 15 | +Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine). |
| 16 | + |
| 17 | +## Deployment using vLLM production stack |
| 18 | + |
| 19 | +The standard vLLM production stack install uses a Helm chart. You can run this [bash script](https://github.com/vllm-project/production-stack/blob/main/tutorials/install-helm.sh) to install Helm on your GPU server. |
| 20 | + |
| 21 | +To install the vLLM production stack, run the following commands on your desktop: |
| 22 | + |
| 23 | +```bash |
| 24 | +sudo helm repo add vllm https://vllm-project.github.io/production-stack |
| 25 | +sudo helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml |
| 26 | +``` |
| 27 | + |
| 28 | +This will instantiate a vLLM-production-stack-based deployment named `vllm` that runs a small LLM (Facebook opt-125M model). |
| 29 | + |
| 30 | +### Validate Installation |
| 31 | + |
| 32 | +Monitor the deployment status using: |
| 33 | + |
| 34 | +```bash |
| 35 | +sudo kubectl get pods |
| 36 | +``` |
| 37 | + |
| 38 | +And you will see that pods for the `vllm` deployment will transit to `Running` state. |
| 39 | + |
| 40 | +```text |
| 41 | +NAME READY STATUS RESTARTS AGE |
| 42 | +vllm-deployment-router-859d8fb668-2x2b7 1/1 Running 0 2m38s |
| 43 | +vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs 1/1 Running 0 2m38s |
| 44 | +``` |
| 45 | + |
| 46 | +**NOTE**: It may take some time for the containers to download the Docker images and LLM weights. |
| 47 | + |
| 48 | +### Send a Query to the Stack |
| 49 | + |
| 50 | +Forward the `vllm-router-service` port to the host machine: |
| 51 | + |
| 52 | +```bash |
| 53 | +sudo kubectl port-forward svc/vllm-router-service 30080:80 |
| 54 | +``` |
| 55 | + |
| 56 | +And then you can send out a query to the OpenAI-compatible API to check the available models: |
| 57 | + |
| 58 | +```bash |
| 59 | +curl -o- http://localhost:30080/models |
| 60 | +``` |
| 61 | + |
| 62 | +Expected output: |
| 63 | + |
| 64 | +```json |
| 65 | +{ |
| 66 | + "object": "list", |
| 67 | + "data": [ |
| 68 | + { |
| 69 | + "id": "facebook/opt-125m", |
| 70 | + "object": "model", |
| 71 | + "created": 1737428424, |
| 72 | + "owned_by": "vllm", |
| 73 | + "root": null |
| 74 | + } |
| 75 | + ] |
| 76 | +} |
| 77 | +``` |
| 78 | + |
| 79 | +To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint: |
| 80 | + |
| 81 | +```bash |
| 82 | +curl -X POST http://localhost:30080/completions \ |
| 83 | + -H "Content-Type: application/json" \ |
| 84 | + -d '{ |
| 85 | + "model": "facebook/opt-125m", |
| 86 | + "prompt": "Once upon a time,", |
| 87 | + "max_tokens": 10 |
| 88 | + }' |
| 89 | +``` |
| 90 | + |
| 91 | +Expected output: |
| 92 | + |
| 93 | +```json |
| 94 | +{ |
| 95 | + "id": "completion-id", |
| 96 | + "object": "text_completion", |
| 97 | + "created": 1737428424, |
| 98 | + "model": "facebook/opt-125m", |
| 99 | + "choices": [ |
| 100 | + { |
| 101 | + "text": " there was a brave knight who...", |
| 102 | + "index": 0, |
| 103 | + "finish_reason": "length" |
| 104 | + } |
| 105 | + ] |
| 106 | +} |
| 107 | +``` |
| 108 | + |
| 109 | +### Uninstall |
| 110 | + |
| 111 | +To remove the deployment, run: |
| 112 | + |
| 113 | +```bash |
| 114 | +sudo helm uninstall vllm |
| 115 | +``` |
| 116 | + |
| 117 | +------ |
| 118 | + |
| 119 | +### (Advanced) Configuring vLLM production stack |
| 120 | + |
| 121 | +The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above: |
| 122 | + |
| 123 | +```yaml |
| 124 | +servingEngineSpec: |
| 125 | + runtimeClassName: "" |
| 126 | + modelSpec: |
| 127 | + - name: "opt125m" |
| 128 | + repository: "vllm/vllm-openai" |
| 129 | + tag: "latest" |
| 130 | + modelURL: "facebook/opt-125m" |
| 131 | + |
| 132 | + replicaCount: 1 |
| 133 | + |
| 134 | + requestCPU: 6 |
| 135 | + requestMemory: "16Gi" |
| 136 | + requestGPU: 1 |
| 137 | + |
| 138 | + pvcStorage: "10Gi" |
| 139 | +``` |
| 140 | +
|
| 141 | +In this YAML configuration: |
| 142 | +* **`modelSpec`** includes: |
| 143 | + * `name`: A nickname that you prefer to call the model. |
| 144 | + * `repository`: Docker repository of vLLM. |
| 145 | + * `tag`: Docker image tag. |
| 146 | + * `modelURL`: The LLM model that you want to use. |
| 147 | +* **`replicaCount`**: Number of replicas. |
| 148 | +* **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod. |
| 149 | +* **`requestGPU`**: Specifies the number of GPUs required. |
| 150 | +* **`pvcStorage`**: Allocates persistent storage for the model. |
| 151 | + |
| 152 | +**NOTE:** If you intend to set up two pods, please refer to this [YAML file](https://github.com/vllm-project/production-stack/blob/main/tutorials/assets/values-01-2pods-minimal-example.yaml). |
| 153 | + |
| 154 | +**NOTE:** vLLM production stack offers many more features (*e.g.* CPU offloading and a wide range of routing algorithms). Please check out these [examples and tutorials](https://github.com/vllm-project/production-stack/tree/main/tutorials) and our [repo](https://github.com/vllm-project/production-stack) for more details! |
0 commit comments