|
| 1 | +# Tutorial: Setting Up a Kubernetes Environment with GPUs on Your GPU Server |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +This tutorial guides you through the process of setting up a Kubernetes environment on a GPU-enabled server. We will install and configure `kubectl`, `helm`, and `minikube`, ensuring GPU compatibility for workloads requiring accelerated computing. By the end of this tutorial, you will have a fully functional Kubernetes environment ready for deploy the vLLM Production Stack. |
| 6 | + |
| 7 | +## Table of Contents |
| 8 | + |
| 9 | +- [Introduction](#introduction) |
| 10 | +- [Table of Contents](#table-of-contents) |
| 11 | +- [Prerequisites](#prerequisites) |
| 12 | +- [Steps](#steps) |
| 13 | + - [Step 1: Installing kubectl](#step-1-installing-kubectl) |
| 14 | + - [Step 2: Installing Helm](#step-2-installing-helm) |
| 15 | + - [Step 3: Installing Minikube with GPU Support](#step-3-installing-minikube-with-gpu-support) |
| 16 | + - [Step 4: Verifying GPU Configuration](#step-4-verifying-gpu-configuration) |
| 17 | + |
| 18 | +## Prerequisites |
| 19 | + |
| 20 | +Before you begin, ensure the following: |
| 21 | + |
| 22 | +1. **GPU Server Requirements:** |
| 23 | + - A server with a GPU and drivers properly installed (e.g., NVIDIA drivers). |
| 24 | + - [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed for GPU workloads. |
| 25 | + |
| 26 | +2. **Access and Permissions:** |
| 27 | + - Root or administrative access to the server. |
| 28 | + - Internet connectivity to download required packages and tools. |
| 29 | + |
| 30 | +3. **Environment Setup:** |
| 31 | + - A Linux-based operating system (e.g., Ubuntu 20.04 or later). |
| 32 | + - Basic understanding of Linux shell commands. |
| 33 | + |
| 34 | +## Steps |
| 35 | + |
| 36 | +### Step 1: Installing kubectl |
| 37 | + |
| 38 | +1. Clone the repository and navigate to the [`utils/`](../utils/) folder: |
| 39 | + |
| 40 | + ```bash |
| 41 | + git clone https://github.com/vllm-project/production-stack.git |
| 42 | + cd production-stack/utils |
| 43 | + ``` |
| 44 | + |
| 45 | +2. Execute the script [`install-kubectl.sh`](../utils/install-kubectl.sh): |
| 46 | + |
| 47 | + ```bash |
| 48 | + bash install-kubectl.sh |
| 49 | + ``` |
| 50 | + |
| 51 | +3. **Explanation:** |
| 52 | + This script downloads the latest version of [`kubectl`](https://kubernetes.io/docs/reference/kubectl), the Kubernetes command-line tool, and places it in your PATH for easy execution. |
| 53 | + |
| 54 | +4. **Expected Output:** |
| 55 | + - Confirmation that `kubectl` was downloaded and installed. |
| 56 | + - Verification message using: |
| 57 | + |
| 58 | + ```bash |
| 59 | + kubectl version --client |
| 60 | + ``` |
| 61 | + |
| 62 | + Example output: |
| 63 | + |
| 64 | + ```plaintext |
| 65 | + Client Version: v1.32.1 |
| 66 | + ``` |
| 67 | + |
| 68 | +### Step 2: Installing Helm |
| 69 | + |
| 70 | +1. Execute the script [`install-helm.sh`](../utils/install-helm.sh): |
| 71 | + |
| 72 | + ```bash |
| 73 | + bash install-helm.sh |
| 74 | + ``` |
| 75 | + |
| 76 | +2. **Explanation:** |
| 77 | + - Downloads and installs Helm, a package manager for Kubernetes. |
| 78 | + - Places the Helm binary in your PATH. |
| 79 | + |
| 80 | +3. **Expected Output:** |
| 81 | + - Successful installation of Helm. |
| 82 | + - Verification message using: |
| 83 | + |
| 84 | + ```bash |
| 85 | + helm version |
| 86 | + ``` |
| 87 | + |
| 88 | + Example output: |
| 89 | + |
| 90 | + ```plaintext |
| 91 | + version.BuildInfo{Version:"v3.17.0", GitCommit:"301108edc7ac2a8ba79e4ebf5701b0b6ce6a31e4", GitTreeState:"clean", GoVersion:"go1.23.4"} |
| 92 | + ``` |
| 93 | + |
| 94 | +### Step 3: Installing Minikube with GPU Support |
| 95 | + |
| 96 | +Before proceeding, ensure Docker runs without requiring sudo. To add your user to the docker group, run: |
| 97 | + |
| 98 | +```bash |
| 99 | +sudo usermod -aG docker $USER && newgrp docker |
| 100 | +``` |
| 101 | + |
| 102 | +If Minikube is already installed on your system, we recommend uninstalling the existing version before proceeding. You may use one of the following commands based on your operating system and package manager: |
| 103 | + |
| 104 | +```bash |
| 105 | +# Ubuntu / Debian |
| 106 | +sudo apt remove minikube |
| 107 | +
|
| 108 | +# RHEL / CentOS / Fedora |
| 109 | +sudo yum remove minikube |
| 110 | +# or |
| 111 | +sudo dnf remove minikube |
| 112 | +
|
| 113 | +# macOS (installed via Homebrew) |
| 114 | +brew uninstall minikube |
| 115 | +
|
| 116 | +# Arch Linux |
| 117 | +sudo pacman -Rs minikube |
| 118 | +
|
| 119 | +# Windows (via Chocolatey) |
| 120 | +choco uninstall minikube |
| 121 | +
|
| 122 | +# Windows (via Scoop) |
| 123 | +scoop uninstall minikube |
| 124 | +``` |
| 125 | + |
| 126 | +After removing the previous installation, please execute the script provided below to install the latest version. |
| 127 | + |
| 128 | +1. Execute the script `install-minikube-cluster.sh`: |
| 129 | + |
| 130 | + ```bash |
| 131 | + bash install-minikube-cluster.sh |
| 132 | + ``` |
| 133 | + |
| 134 | +2. **Explanation:** |
| 135 | + - Installs Minikube if not already installed. |
| 136 | + - Configures the system to support GPU workloads by enabling the NVIDIA Container Toolkit and starting Minikube with GPU support. |
| 137 | + - Installs the NVIDIA `gpu-operator` chart to manage GPU resources within the cluster. |
| 138 | + |
| 139 | +3. **Expected Output:** |
| 140 | + If everything goes smoothly, you should see the example output like following: |
| 141 | + |
| 142 | + ```plaintext |
| 143 | + 😄 minikube v1.35.0 on Ubuntu 22.04 (kvm/amd64) |
| 144 | + ❗ minikube skips various validations when --force is supplied; this may lead to unexpected behavior |
| 145 | + ✨ Using the docker driver based on user configuration |
| 146 | + ...... |
| 147 | + ...... |
| 148 | + 🏄 Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default |
| 149 | + "nvidia" has been added to your repositories |
| 150 | + Hang tight while we grab the latest from your chart repositories... |
| 151 | + ...... |
| 152 | + ...... |
| 153 | + NAME: gpu-operator-1737507918 |
| 154 | + LAST DEPLOYED: Wed Jan 22 01:05:21 2025 |
| 155 | + NAMESPACE: gpu-operator |
| 156 | + STATUS: deployed |
| 157 | + REVISION: 1 |
| 158 | + TEST SUITE: None |
| 159 | + ``` |
| 160 | +
|
| 161 | +4. Some troubleshooting tips for installing gpu-operator: |
| 162 | +
|
| 163 | + If gpu-operator fails to start because of the common seen “too many open files” issue for minikube (and [kind](https://kind.sigs.k8s.io/)), then a quick fix below may be helpful. |
| 164 | +
|
| 165 | + The issue can be observed by one or more gpu-operator pods in `CrashLoopBackOff` status, and be confirmed by checking their logs. For example, |
| 166 | +
|
| 167 | + ```console |
| 168 | + $ kubectl -n gpu-operator logs daemonset/nvidia-device-plugin-daemonset -c nvidia-device-plugin |
| 169 | + IS_HOST_DRIVER=true |
| 170 | + NVIDIA_DRIVER_ROOT=/ |
| 171 | + DRIVER_ROOT_CTR_PATH=/host |
| 172 | + NVIDIA_DEV_ROOT=/ |
| 173 | + DEV_ROOT_CTR_PATH=/host |
| 174 | + Starting nvidia-device-plugin |
| 175 | + I0131 19:35:42.895845 1 main.go:235] "Starting NVIDIA Device Plugin" version=< |
| 176 | + d475b2cf |
| 177 | + commit: d475b2cfcf12b983a4975d4fc59d91af432cf28e |
| 178 | + > |
| 179 | + I0131 19:35:42.895917 1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins |
| 180 | + E0131 19:35:42.895933 1 main.go:173] failed to create FS watcher for /var/lib/kubelet/device-plugins/: too many open files |
| 181 | + ``` |
| 182 | +
|
| 183 | + The fix is [well documented](https://kind.sigs.k8s.io/docs/user/known-issues#pod-errors-due-to-too-many-open-files) by kind, it also works for minikube. |
| 184 | +
|
| 185 | +### Step 4: Verifying GPU Configuration |
| 186 | +
|
| 187 | +1. Ensure Minikube is running: |
| 188 | +
|
| 189 | + ```bash |
| 190 | + minikube status |
| 191 | + ``` |
| 192 | +
|
| 193 | + Expected output: |
| 194 | +
|
| 195 | + ```plaintext |
| 196 | + minikube |
| 197 | + type: Control Plane |
| 198 | + host: Running |
| 199 | + kubelet: Running |
| 200 | + apiserver: Running |
| 201 | + kubeconfig: Configured |
| 202 | + ``` |
| 203 | +
|
| 204 | +2. Verify GPU access within Kubernetes: |
| 205 | +
|
| 206 | + ```bash |
| 207 | + kubectl describe nodes | grep -i gpu |
| 208 | + ``` |
| 209 | +
|
| 210 | + Expected output: |
| 211 | +
|
| 212 | + ```plaintext |
| 213 | + nvidia.com/gpu: 1 |
| 214 | + ... (plus many lines related to gpu information) |
| 215 | + ``` |
| 216 | +
|
| 217 | +3. Deploy a test GPU workload: |
| 218 | +
|
| 219 | + ```bash |
| 220 | + kubectl run gpu-test --image=nvidia/cuda:12.2.0-runtime-ubuntu22.04 --restart=Never -- nvidia-smi |
| 221 | + ``` |
| 222 | +
|
| 223 | + Wait for kubernetes to download and create the pod and then check logs to confirm GPU usage: |
| 224 | +
|
| 225 | + ```bash |
| 226 | + kubectl logs gpu-test |
| 227 | + ``` |
| 228 | +
|
| 229 | + You should see the nvidia-smi output from the terminal |
| 230 | +
|
| 231 | +## Conclusion |
| 232 | +
|
| 233 | +By following this tutorial, you have successfully set up a Kubernetes environment with GPU support on your server. You are now ready to deploy and test vLLM Production Stack on Kubernetes. For further configuration and workload-specific setups, consult the official documentation for `kubectl`, `helm`, and `minikube`. |
| 234 | +
|
| 235 | +What's next: |
| 236 | +
|
| 237 | +- [01-minimal-helm-installation](https://github.com/vllm-project/production-stack/blob/main/tutorials/01-minimal-helm-installation.md) |
0 commit comments