Skip to content

Commit d41dbb9

Browse files
committed
[Doc] Elaborated basic pipeline parallelism tutorial document.
Signed-off-by: insukim1994 <insu.kim@moreh.io>
1 parent f93aff3 commit d41dbb9

File tree

3 files changed

+353
-49
lines changed

3 files changed

+353
-49
lines changed
Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
# Tutorial: Setting Up a Kubernetes Environment with GPUs on Your GPU Server
2+
3+
## Introduction
4+
5+
This tutorial guides you through the process of setting up a Kubernetes environment on a GPU-enabled server. We will install and configure `kubectl`, `helm`, and `minikube`, ensuring GPU compatibility for workloads requiring accelerated computing. By the end of this tutorial, you will have a fully functional Kubernetes environment ready for deploy the vLLM Production Stack.
6+
7+
## Table of Contents
8+
9+
- [Introduction](#introduction)
10+
- [Table of Contents](#table-of-contents)
11+
- [Prerequisites](#prerequisites)
12+
- [Steps](#steps)
13+
- [Step 1: Installing kubectl](#step-1-installing-kubectl)
14+
- [Step 2: Installing Helm](#step-2-installing-helm)
15+
- [Step 3: Installing Minikube with GPU Support](#step-3-installing-minikube-with-gpu-support)
16+
- [Step 4: Verifying GPU Configuration](#step-4-verifying-gpu-configuration)
17+
18+
## Prerequisites
19+
20+
Before you begin, ensure the following:
21+
22+
1. **GPU Server Requirements:**
23+
- A server with a GPU and drivers properly installed (e.g., NVIDIA drivers).
24+
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed for GPU workloads.
25+
26+
2. **Access and Permissions:**
27+
- Root or administrative access to the server.
28+
- Internet connectivity to download required packages and tools.
29+
30+
3. **Environment Setup:**
31+
- A Linux-based operating system (e.g., Ubuntu 20.04 or later).
32+
- Basic understanding of Linux shell commands.
33+
34+
## Steps
35+
36+
### Step 1: Installing kubectl
37+
38+
1. Clone the repository and navigate to the [`utils/`](../utils/) folder:
39+
40+
```bash
41+
git clone https://github.com/vllm-project/production-stack.git
42+
cd production-stack/utils
43+
```
44+
45+
2. Execute the script [`install-kubectl.sh`](../utils/install-kubectl.sh):
46+
47+
```bash
48+
bash install-kubectl.sh
49+
```
50+
51+
3. **Explanation:**
52+
This script downloads the latest version of [`kubectl`](https://kubernetes.io/docs/reference/kubectl), the Kubernetes command-line tool, and places it in your PATH for easy execution.
53+
54+
4. **Expected Output:**
55+
- Confirmation that `kubectl` was downloaded and installed.
56+
- Verification message using:
57+
58+
```bash
59+
kubectl version --client
60+
```
61+
62+
Example output:
63+
64+
```plaintext
65+
Client Version: v1.32.1
66+
```
67+
68+
### Step 2: Installing Helm
69+
70+
1. Execute the script [`install-helm.sh`](../utils/install-helm.sh):
71+
72+
```bash
73+
bash install-helm.sh
74+
```
75+
76+
2. **Explanation:**
77+
- Downloads and installs Helm, a package manager for Kubernetes.
78+
- Places the Helm binary in your PATH.
79+
80+
3. **Expected Output:**
81+
- Successful installation of Helm.
82+
- Verification message using:
83+
84+
```bash
85+
helm version
86+
```
87+
88+
Example output:
89+
90+
```plaintext
91+
version.BuildInfo{Version:"v3.17.0", GitCommit:"301108edc7ac2a8ba79e4ebf5701b0b6ce6a31e4", GitTreeState:"clean", GoVersion:"go1.23.4"}
92+
```
93+
94+
### Step 3: Installing Minikube with GPU Support
95+
96+
Before proceeding, ensure Docker runs without requiring sudo. To add your user to the docker group, run:
97+
98+
```bash
99+
sudo usermod -aG docker $USER && newgrp docker
100+
```
101+
102+
If Minikube is already installed on your system, we recommend uninstalling the existing version before proceeding. You may use one of the following commands based on your operating system and package manager:
103+
104+
```bash
105+
# Ubuntu / Debian
106+
sudo apt remove minikube
107+
108+
# RHEL / CentOS / Fedora
109+
sudo yum remove minikube
110+
# or
111+
sudo dnf remove minikube
112+
113+
# macOS (installed via Homebrew)
114+
brew uninstall minikube
115+
116+
# Arch Linux
117+
sudo pacman -Rs minikube
118+
119+
# Windows (via Chocolatey)
120+
choco uninstall minikube
121+
122+
# Windows (via Scoop)
123+
scoop uninstall minikube
124+
```
125+
126+
After removing the previous installation, please execute the script provided below to install the latest version.
127+
128+
1. Execute the script `install-minikube-cluster.sh`:
129+
130+
```bash
131+
bash install-minikube-cluster.sh
132+
```
133+
134+
2. **Explanation:**
135+
- Installs Minikube if not already installed.
136+
- Configures the system to support GPU workloads by enabling the NVIDIA Container Toolkit and starting Minikube with GPU support.
137+
- Installs the NVIDIA `gpu-operator` chart to manage GPU resources within the cluster.
138+
139+
3. **Expected Output:**
140+
If everything goes smoothly, you should see the example output like following:
141+
142+
```plaintext
143+
😄 minikube v1.35.0 on Ubuntu 22.04 (kvm/amd64)
144+
❗ minikube skips various validations when --force is supplied; this may lead to unexpected behavior
145+
✨ Using the docker driver based on user configuration
146+
......
147+
......
148+
🏄 Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default
149+
"nvidia" has been added to your repositories
150+
Hang tight while we grab the latest from your chart repositories...
151+
......
152+
......
153+
NAME: gpu-operator-1737507918
154+
LAST DEPLOYED: Wed Jan 22 01:05:21 2025
155+
NAMESPACE: gpu-operator
156+
STATUS: deployed
157+
REVISION: 1
158+
TEST SUITE: None
159+
```
160+
161+
4. Some troubleshooting tips for installing gpu-operator:
162+
163+
If gpu-operator fails to start because of the common seen “too many open files” issue for minikube (and [kind](https://kind.sigs.k8s.io/)), then a quick fix below may be helpful.
164+
165+
The issue can be observed by one or more gpu-operator pods in `CrashLoopBackOff` status, and be confirmed by checking their logs. For example,
166+
167+
```console
168+
$ kubectl -n gpu-operator logs daemonset/nvidia-device-plugin-daemonset -c nvidia-device-plugin
169+
IS_HOST_DRIVER=true
170+
NVIDIA_DRIVER_ROOT=/
171+
DRIVER_ROOT_CTR_PATH=/host
172+
NVIDIA_DEV_ROOT=/
173+
DEV_ROOT_CTR_PATH=/host
174+
Starting nvidia-device-plugin
175+
I0131 19:35:42.895845 1 main.go:235] "Starting NVIDIA Device Plugin" version=<
176+
d475b2cf
177+
commit: d475b2cfcf12b983a4975d4fc59d91af432cf28e
178+
>
179+
I0131 19:35:42.895917 1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
180+
E0131 19:35:42.895933 1 main.go:173] failed to create FS watcher for /var/lib/kubelet/device-plugins/: too many open files
181+
```
182+
183+
The fix is [well documented](https://kind.sigs.k8s.io/docs/user/known-issues#pod-errors-due-to-too-many-open-files) by kind, it also works for minikube.
184+
185+
### Step 4: Verifying GPU Configuration
186+
187+
1. Ensure Minikube is running:
188+
189+
```bash
190+
minikube status
191+
```
192+
193+
Expected output:
194+
195+
```plaintext
196+
minikube
197+
type: Control Plane
198+
host: Running
199+
kubelet: Running
200+
apiserver: Running
201+
kubeconfig: Configured
202+
```
203+
204+
2. Verify GPU access within Kubernetes:
205+
206+
```bash
207+
kubectl describe nodes | grep -i gpu
208+
```
209+
210+
Expected output:
211+
212+
```plaintext
213+
nvidia.com/gpu: 1
214+
... (plus many lines related to gpu information)
215+
```
216+
217+
3. Deploy a test GPU workload:
218+
219+
```bash
220+
kubectl run gpu-test --image=nvidia/cuda:12.2.0-runtime-ubuntu22.04 --restart=Never -- nvidia-smi
221+
```
222+
223+
Wait for kubernetes to download and create the pod and then check logs to confirm GPU usage:
224+
225+
```bash
226+
kubectl logs gpu-test
227+
```
228+
229+
You should see the nvidia-smi output from the terminal
230+
231+
## Conclusion
232+
233+
By following this tutorial, you have successfully set up a Kubernetes environment with GPU support on your server. You are now ready to deploy and test vLLM Production Stack on Kubernetes. For further configuration and workload-specific setups, consult the official documentation for `kubectl`, `helm`, and `minikube`.
234+
235+
What's next:
236+
237+
- [01-minimal-helm-installation](https://github.com/vllm-project/production-stack/blob/main/tutorials/01-minimal-helm-installation.md)

0 commit comments

Comments
 (0)