Skip to content

Commit 358e908

Browse files
saurabh-nvidiaUbuntu
andauthored
docs: Adding document for running Dynamo on Azure Kubernetes Services (#2080)
Co-authored-by: Ubuntu <saurabha@saurabha-cpu.l5mxjxajs0be3dwlje3wrdx5ie.xx.internal.cloudapp.net>
1 parent 7fbd43a commit 358e908

File tree

1 file changed

+200
-0
lines changed

1 file changed

+200
-0
lines changed
Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# Dynamo on AKS
2+
3+
4+
This document covers the process of deploying Dynamo Cloud and running inference in a vLLM distributed runtime within a Azure Kubernetes environment, covering the setup process on a Azure Kubernetes Cluster, all the way from setup to testing inference.
5+
6+
7+
### Task 1. Infrastructure Deployment
8+
9+
1. Open **Azure Cloud Shell** or a ternimal on an Azure VM and install pre-reqs:
10+
```
11+
az login
12+
13+
az extension add --name aks-preview
14+
az extension update --name aks-preview
15+
```
16+
17+
generate an rsa ssh key for using with aks cluster:
18+
```
19+
ssh-keygen -t rsa -b 4096 -C "<email@id.com>"
20+
```
21+
22+
2. Create AKS Cluster
23+
```
24+
export REGION=<region>
25+
export RESOURCE_GROUP=<rg_name>
26+
export ZONE=<zone>
27+
export CLUSTER_NAME=<aks_cluster_name>
28+
export CPU_COUNT=1
29+
30+
az aks create -g $RESOURCE_GROUP -n $CLUSTER_NAME --location $REGION --zones $ZONE --node-count $CPU_COUNT --enable-node-public-ip --ssh-key-value /home/user/.ssh/id_rsa.pub
31+
```
32+
33+
3. Check if it was created correctly
34+
``` bash
35+
# Get Credentials
36+
az aks get-credentials --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME
37+
38+
kubectl config get-contexts
39+
40+
#You should see output like this:
41+
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
42+
* dynamo-aks dynamo-aks clusterUser_<rg_name>_<aks_cluster_name>
43+
```
44+
45+
4. Create GPU node pool: You can use as many computes of whatever SKU you want, here we have used 4 nodes of standard_nc24ads_a100_v4, which have 1 A100 each.
46+
```
47+
az aks nodepool add --resource-group $RESOURCE_GROUP --cluster-name $CLUSTER_NAME --name gpupool --node-count 4 --skip-gpu-driver-install --node-vm-size standard_nc24ads_a100_v4 --node-osdisk-size 2048 --max-pods 110
48+
```
49+
50+
### Task 2. Install Nvidia GPU Operator
51+
52+
Once your AKS cluster is configured with a GPU-enabled node pool, we can proceed with setting up the NVIDIA GPU Operator. This operator automates the deployment and lifecycle of all NVIDIA software components required to provision GPUs in the Kubernetes cluster. The NVIDIA GPU operator enables the infrastructure to support GPU workloads like LLM inference and embedding generation.
53+
54+
1. Add the NVIDIA Helm repository:
55+
```
56+
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --pass-credentials && helm repo update
57+
```
58+
59+
2. Install the GPU Operator:
60+
```
61+
helm install --create-namespace --namespace gpu-operator nvidia/gpu-operator --wait --generate-name
62+
```
63+
64+
3. Validate install (Takes about 5 mins to complete):
65+
```
66+
kubectl get pods -A -o wide
67+
```
68+
69+
You should see output similar to the example below. Note that this is not the complete output, there should be additional pods running. The most important thing is to verify that the GPU Operator pods are in a `Running` state.
70+
71+
```
72+
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
73+
gpu-operator gpu-operator-xxxx-node-feature-discovery-gc-xxxxxxxxx 1/1 Running 0 40s 10.244.0.194 aks-nodepool1-xxxx
74+
gpu-operator gpu-operator-xxxx-node-feature-discovery-master-xxxxxxxxx 1/1 Running 0 40s 10.244.0.200 aks-nodepool1-xxxx
75+
gpu-operator gpu-operator-xxxx-node-feature-discovery-worker-xxxxxxxxx 1/1 Running 0 40s 10.244.0.190 aks-nodepool1-xxxx
76+
gpu-operator gpu-operator-xxxxxxxxxxxxxx 1/1 Running 0 40s 10.244.0.128 aks-nodepool1-xxxx
77+
```
78+
79+
For additional guidance on setting up GPU node pools in AKS, refer to the [Microsoft Docs](https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool).
80+
81+
### Task 3. Configure Dynamo
82+
83+
1. Pull Dynamo Repo
84+
The Dynamo GitHub repository will be leveraged extensively throughout this walkthrough. Pull the repository using:
85+
```bash
86+
# clone Dynamo GitHub repo
87+
git clone https://github.com/ai-dynamo/dynamo.git
88+
89+
# go to root of Dynamo repo, latest commit at the time of writing this document was 22e6c96f715177c776421c90e9415a7dbc4f661a
90+
cd dynamo
91+
```
92+
93+
2. Install Dynamo from Published Artifacts on NGC (refer: https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/quickstart.md):
94+
```bash
95+
export NAMESPACE=dynamo-cloud
96+
export RELEASE_VERSION=0.3.2
97+
98+
#The above linked document says to authenticate using NGC_API_KEY, not neccessary, since this is an openly available container
99+
100+
# Fetch the CRDs helm chart
101+
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
102+
103+
# Fetch the platform helm chart
104+
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
105+
106+
# Step 1: Install Custom Resource Definitions (CRDs)
107+
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz \
108+
--namespace default \
109+
--wait \
110+
--atomic
111+
112+
#Step 2: Install Dynamo Platform
113+
kubectl create namespace ${NAMESPACE}
114+
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE}
115+
116+
# Check pod status:
117+
kubectl get pods -n $NAMESPACE
118+
119+
# output should be similar
120+
NAME READY STATUS RESTARTS AGE
121+
dynamo-platform-dynamo-operator-controller-manager-549b5d5xf7rv 2/2 Running 0 2m50s
122+
dynamo-platform-etcd-0 1/1 Running 0 2m50s
123+
dynamo-platform-nats-0 2/2 Running 0 2m50s
124+
dynamo-platform-nats-box-5dbf45c748-kln82 1/1 Running 0 2m51s
125+
```
126+
127+
There are other ways to install Dynamo, you can find them [here](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/quickstart.md)
128+
129+
### Task 4. Deploy a model
130+
131+
We're going to be deploying MSFTs Phi-3.5-vision-instruct. You can alter this flow to deploy whatever model you need.
132+
133+
Refer: [dynamo/docs/examples/README.md at main · ai-dynamo/dynamo](https://github.com/ai-dynamo/dynamo/blob/main/docs/examples/README.md)
134+
135+
```bash
136+
# Set your dynamo root directory
137+
cd <root-dynamo-folder>
138+
export PROJECT_ROOT=$(pwd)
139+
140+
# Create a Kubernetes secret containing your sensitive values:
141+
export HF_TOKEN=your_hf_token
142+
kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN} -n ${NAMESPACE}
143+
144+
# Deploying an example (Time taken depends on model, phi3v took ~5mins)
145+
# You can edit the number os replicas of encoder/ decoder independently here to suit your deployment needs
146+
147+
kubectl apply -f examples/multimodal/deploy/k8s/agg-phi3v.yaml -n ${NAMESPACE}
148+
149+
# Get status of deployment
150+
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
151+
152+
# You can use any of the following commands to see logs for debugging
153+
kubectl get pods -n ${NAMESPACE} -o wide
154+
kubectl logs <pod-name> -n ${NAMESPACE}
155+
kubectl exec -it <pod-name> -n ${NAMESPACE} -- nvidia-smi
156+
157+
# Enable Port forwarding to be able to hit a curl request
158+
kubectl get svc -n ${NAMESPACE}
159+
160+
#Look for one that ends in -frontend and use it for port forward.
161+
SERVICE_NAME=$(kubectl get svc -n ${NAMESPACE} -o name | grep frontend | sed 's|.*/||' | sed 's|-frontend||' | head -n1)
162+
kubectl port-forward svc/${SERVICE_NAME}-frontend 8000:8000 -n ${NAMESPACE} &
163+
```
164+
165+
#### Task 5. Testing
166+
167+
```
168+
curl localhost:8000/v1/chat/completions \
169+
-H "Content-Type: application/json" \
170+
-d '{
171+
"model": "microsoft/Phi-3.5-vision-instruct",
172+
"messages": [
173+
{
174+
"role": "user",
175+
"content": [
176+
{ "type": "text", "text": "What is in this image?" },
177+
{ "type": "image_url", "image_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
178+
]
179+
}
180+
],
181+
"stream": false
182+
}'
183+
184+
#Output should be something like:
185+
{"id": "a200785a-a4dd-4208-8ced-2d0ea30351a4", "object": "chat.completion", "created": 1753223375, "model": "microsoft/Phi-3.5-vision-instruct", "choices": [{"index": 0, "message": {"role": "assistant", "content": " The image features a wooden boardwalk extending into a grassy area surrounded by a wetland. There are water lilies in the water, and the sky is clear with a few clouds. The sun is shining, casting light on the scene, and there are trees visible in the background."}, "finish_reason": "stop"}]}
186+
```
187+
188+
## Clean Up Resources
189+
190+
In order to clean up any Dynamo related resources, from the container shell you launched the deployment from, simply run the following command:
191+
192+
```bash
193+
# Delete deployment
194+
kubectl delete dynamoGraphDeployment <your-dep-name> -n ${NAMESPACE}
195+
196+
# Delete the AKS Cluster
197+
az aks delete --name $CLUSTER_NAME --resource-group $RESOURCE_GROUP --yes
198+
```
199+
200+
This will spin down the Dynamo deployment we configured and spin down all the resources that were leveraged for the deployment.

0 commit comments

Comments
 (0)