Skip to content

Commit

Permalink
24.3.0 Release
Browse files Browse the repository at this point in the history
  • Loading branch information
angudadevops committed Mar 25, 2024
1 parent 376979a commit f16d9bf
Show file tree
Hide file tree
Showing 10 changed files with 101 additions and 101 deletions.
22 changes: 11 additions & 11 deletions install-guides/DGX-6.0_Server_v10.3.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ NVIDIA Cloud Native Stack for DGX is focused to provide the Docker based experin
NVIDIA Cloud Native Stack v10.3 includes:
- Ubuntu 22.04 LTS
- Containerd 1.7.7
- Kubernetes version 1.27.7
- Kubernetes version 1.27.6
- Helm 3.13.1
- NVIDIA GPU Driver: 535.104.12
- NVIDIA GPU Driver: 535.129.03
- NVIDIA Container Toolkit: 1.14.3
- NVIDIA GPU Operator 23.9.0
- NVIDIA GPU Operator 23.9.1
- NVIDIA K8S Device Plugin: 0.14.2
- NVIDIA DCGM-Exporter: 3.2.6-3.1.9
- NVIDIA DCGM: 3.2.6-1
Expand Down Expand Up @@ -276,7 +276,7 @@ Now execute the below to install kubelet, kubeadm, and kubectl:
sudo apt-get update
```
```
sudo apt-get install -y -q kubelet=1.27.7-00 kubectl=1.27.7-00 kubeadm=1.27.7-00
sudo apt-get install -y -q kubelet=1.27.6-00 kubectl=1.27.6-00 kubeadm=1.27.6-00
```
```
sudo apt-mark hold kubelet kubeadm kubectl
Expand Down Expand Up @@ -329,13 +329,13 @@ UUID=DCD4-535C /boot/efi vfat defaults 0 0
Execute the following command for `Containerd` systems:

```
sudo kubeadm init --pod-network-cidr=192.168.32.0/22 --cri-socket=/run/containerd/containerd.sock --kubernetes-version="v1.27.7"
sudo kubeadm init --pod-network-cidr=192.168.32.0/22 --cri-socket=/run/containerd/containerd.sock --kubernetes-version="v1.27.6"
```

Eecute the following command for `CRI-O` systems:

```
sudo kubeadm init --pod-network-cidr=192.168.32.0/22 --cri-socket=unix:/run/crio/crio.sock --kubernetes-version="v1.27.7"
sudo kubeadm init --pod-network-cidr=192.168.32.0/22 --cri-socket=unix:/run/crio/crio.sock --kubernetes-version="v1.27.6"
```

Output:
Expand Down Expand Up @@ -414,7 +414,7 @@ Output:

```
NAME STATUS ROLES AGE VERSION
#yourhost Ready control-plane,master 10m v1.27.7
#yourhost Ready control-plane,master 10m v1.27.6
```

Since we are using a single-node Kubernetes cluster, the cluster will not schedule pods on the control plane node by default. To schedule pods on the control plane node, we have to remove the taint by executing the following command:
Expand Down Expand Up @@ -475,8 +475,8 @@ Output:

```
NAME STATUS ROLES AGE VERSION
#yourhost Ready control-plane,master 10m v1.27.7
#yourhost-worker Ready 10m v1.27.7
#yourhost Ready control-plane,master 10m v1.27.6
#yourhost-worker Ready 10m v1.27.6
```

### Installing GPU Operator
Expand All @@ -498,7 +498,7 @@ Install GPU Operator:
`NOTE:` As we are preinstalled with NVIDIA Driver and NVIDIA Container Toolkit, we need to set as `false` when installing the GPU Operator

```
helm install --version 23.9.0 --create-namespace --namespace nvidia-gpu-operator --devel nvidia/gpu-operator --set driver.enabled=false,toolkit.enabled=false --wait --generate-name
helm install --version 23.9.1 --create-namespace --namespace nvidia-gpu-operator --devel nvidia/gpu-operator --set driver.enabled=false,toolkit.enabled=false --wait --generate-name
```

#### Validating the State of the GPU Operator:
Expand Down Expand Up @@ -715,7 +715,7 @@ Execute the below commands to uninstall the GPU Operator:
```
$ helm ls
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator-1606173805 nvidia-gpu-operator 1 2023-04-14 20:23:28.063421701 +0000 UTC deployed gpu-operator-23.9.0 v23.9.0
gpu-operator-1606173805 nvidia-gpu-operator 1 2023-04-14 20:23:28.063421701 +0000 UTC deployed gpu-operator-23.9.1 v23.9.1
$ helm del gpu-operator-1606173805 -n nvidia-gpu-operator
```
4 changes: 2 additions & 2 deletions install-guides/Jetson_Xavier_v10.3.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This document describes how to setup the NVIDIA Cloud Native Stack collection on
The final environment will include:

- JetPack 5.1
- Kubernetes version 1.27.7
- Kubernetes version 1.27.6
- Helm 3.13.1
- Containerd 1.7.7

Expand Down Expand Up @@ -217,7 +217,7 @@ Now execute the below to install kubelet, kubeadm, and kubectl:

```
sudo apt-get update
sudo apt-get install -y -q kubelet=1.27.7-00 kubectl=1.27.7-00 kubeadm=1.27.7-00
sudo apt-get install -y -q kubelet=1.27.6-00 kubectl=1.27.6-00 kubeadm=1.27.6-00
sudo apt-mark hold kubelet kubeadm kubectl
```

Expand Down
4 changes: 2 additions & 2 deletions install-guides/Jetson_Xavier_v11.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This document describes how to setup the NVIDIA Cloud Native Stack collection on
The final environment will include:

- JetPack 5.1
- Kubernetes version 1.28.3
- Kubernetes version 1.28.2
- Helm 3.13.1
- Containerd 1.7.7

Expand Down Expand Up @@ -217,7 +217,7 @@ Now execute the below to install kubelet, kubeadm, and kubectl:

```
sudo apt-get update
sudo apt-get install -y -q kubelet=1.28.3-00 kubectl=1.28.3-00 kubeadm=1.28.3-00
sudo apt-get install -y -q kubelet=1.28.2-00 kubectl=1.28.2-00 kubeadm=1.28.2-00
sudo apt-mark hold kubelet kubeadm kubectl
```

Expand Down
34 changes: 17 additions & 17 deletions install-guides/RHEL-8-7_Server_x86-arm64_v10.3.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@

This document describes how to setup the NVIDIA Cloud Native Stack collection on a single or multiple NVIDIA Certified Systems. NVIDIA Cloud Native Stack can be configured to create a single node Kubernetes cluster or to create/add additional worker nodes to join an existing cluster.

NVIDIA Cloud Native Stack v11.0 includes:
NVIDIA Cloud Native Stack v10.3 includes:
- RHEL 8.7/RHEL 8.8
- Containerd 1.7.7
- Kubernetes version 1.27.7
- Kubernetes version 1.27.6
- Helm 3.13.1
- NVIDIA GPU Operator 23.9.0
- NVIDIA GPU Driver: 535.104.12
- NVIDIA GPU Operator 23.9.1
- NVIDIA GPU Driver: 535.129.03
- NVIDIA Container Toolkit: 1.14.3
- NVIDIA K8S Device Plugin: 0.14.2
- NVIDIA DCGM-Exporter: 3.2.6-3.1.9
Expand Down Expand Up @@ -270,7 +270,7 @@ EOF
Now execute the below to install kubelet, kubeadm, and kubectl:

```
sudo dnf install -y kubelet-1.27.7 kubeadm-1.27.7 kubectl-1.27.7
sudo dnf install -y kubelet-1.27.6 kubeadm-1.27.6 kubectl-1.27.6
```

Create a kubelet default with your container runtime:
Expand Down Expand Up @@ -319,13 +319,13 @@ UUID=DCD4-535C /boot/efi vfat defaults 0 0
Execute the following command for `Containerd` systems:

```
sudo kubeadm init --pod-network-cidr=192.168.32.0/22 --cri-socket=/run/containerd/containerd.sock --kubernetes-version="v1.27.7"
sudo kubeadm init --pod-network-cidr=192.168.32.0/22 --cri-socket=/run/containerd/containerd.sock --kubernetes-version="v1.27.6"
```

Eecute the following command for `CRI-O` systems:

```
sudo kubeadm init --pod-network-cidr=192.168.32.0/22 --cri-socket=unix:/run/crio/crio.sock --kubernetes-version="v1.27.7"
sudo kubeadm init --pod-network-cidr=192.168.32.0/22 --cri-socket=unix:/run/crio/crio.sock --kubernetes-version="v1.27.6"
```

Output:
Expand Down Expand Up @@ -410,7 +410,7 @@ Output:

```
NAME STATUS ROLES AGE VERSION
#yourhost Ready control-plane 10m v1.27.7
#yourhost Ready control-plane 10m v1.27.6
```

Since we are using a single-node Kubernetes cluster, the cluster will not schedule pods on the control plane node by default. To schedule pods on the control plane node, we have to remove the taint by executing the following command:
Expand Down Expand Up @@ -497,8 +497,8 @@ Output:

```
NAME STATUS ROLES AGE VERSION
#yourhost Ready control-plane 10m v1.27.7
#yourhost-worker Ready 10m v1.27.7
#yourhost Ready control-plane 10m v1.27.6
#yourhost-worker Ready 10m v1.27.6
```

### Adding an Additional Node to NVIDIA Cloud Native Stack
Expand Down Expand Up @@ -535,8 +535,8 @@ Output:

```
NAME STATUS ROLES AGE VERSION
#yourhost Ready control-plane 10m v1.27.7
#yourhost-worker Ready 10m v1.27.7
#yourhost Ready control-plane 10m v1.27.6
#yourhost-worker Ready 10m v1.27.6
```

### Installing NVIDIA Network Operator
Expand Down Expand Up @@ -641,7 +641,7 @@ Install GPU Operator:
`NOTE:` If you installed Network Operator, please skip the below command and follow the [GPU Operator with RDMA](#GPU-Operator-with-RDMA)

```
helm install --version 23.9.0 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --wait --generate-name
helm install --version 23.9.1 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --wait --generate-name
```

#### GPU Operator with RDMA
Expand All @@ -652,15 +652,15 @@ Install GPU Operator:
After Network Operator installation is completed, execute the below command to install the GPU Operator to load nv_peer_mem modules:

```
helm install --version 23.9.0 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --set driver.rdma.enabled=true --wait --generate-name
helm install --version 23.9.1 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --set driver.rdma.enabled=true --wait --generate-name
```

#### GPU Operator with Host MOFED Driver and RDMA

If the host is already installed MOFED driver without network operator, execute the below command to install the GPU Operator to load nv_peer_mem module

```
helm install --version 23.9.0 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --set driver.rdma.enabled=true,driver.rdma.useHostMofed=true --wait --generate-name
helm install --version 23.9.1 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --set driver.rdma.enabled=true,driver.rdma.useHostMofed=true --wait --generate-name
```

Expand All @@ -669,7 +669,7 @@ If the host is already installed MOFED driver without network operator, execute
Execute the below command to enable the GPU Direct Storage Driver on GPU Operator

```
helm install --version 23.9.0 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --set gds.enabled=true
helm install --version 23.9.1 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --set gds.enabled=true
```
For more information refer, [GPU Direct Storage](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/gpu-operator-rdma.html)

Expand Down Expand Up @@ -1160,7 +1160,7 @@ Execute the below commands to uninstall the GPU Operator:
```
$ helm ls
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator-1606173805 nvidia-gpu-operator 1 2023-03-14 20:23:28.063421701 +0000 UTC deployed gpu-operator-23.9.0 v23.3.2
gpu-operator-1606173805 nvidia-gpu-operator 1 2023-03-14 20:23:28.063421701 +0000 UTC deployed gpu-operator-23.9.1 v23.3.2
$ helm del gpu-operator-1606173805 -n nvidia-gpu-operator
Expand Down
4 changes: 2 additions & 2 deletions install-guides/RHEL-8-7_Server_x86-arm64_v10.4.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
<h1>NVIDIA Cloud Native Stack v11.0 - Install Guide for RHEL Server</h1>
<h1>NVIDIA Cloud Native Stack v10.4 - Install Guide for RHEL Server</h1>
<h2>Introduction</h2>

This document describes how to setup the NVIDIA Cloud Native Stack collection on a single or multiple NVIDIA Certified Systems. NVIDIA Cloud Native Stack can be configured to create a single node Kubernetes cluster or to create/add additional worker nodes to join an existing cluster.

NVIDIA Cloud Native Stack v11.0 includes:
NVIDIA Cloud Native Stack v10.4 includes:
- RHEL 8.7/RHEL 8.8
- Containerd 1.7.13
- Kubernetes version 1.27.10
Expand Down
36 changes: 18 additions & 18 deletions install-guides/RHEL-8-7_Server_x86-arm64_v11.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ This document describes how to setup the NVIDIA Cloud Native Stack collection on
NVIDIA Cloud Native Stack v11.0 includes:
- RHEL 8.7/RHEL 8.8
- Containerd 1.7.7
- Kubernetes version 1.28.3
- Kubernetes version 1.28.2
- Helm 3.13.1
- NVIDIA GPU Operator 23.9.0
- NVIDIA GPU Operator 23.9.1
- NVIDIA GPU Driver: 535.104.12
- NVIDIA Container Toolkit: 1.14.3
- NVIDIA K8S Device Plugin: 0.14.2
Expand All @@ -22,7 +22,7 @@ NVIDIA Cloud Native Stack v11.0 includes:
- NVIDIA GDS Driver: 2.16.1
- NVIDIA Kata Manager for Kubernetes: 0.1.2
- NVIDIA Confidential Computing Manager for Kubernetes: 0.1.1
- NVIDIA Network Operator 23.7.0
- NVIDIA Network Operator 23.10.0
- Mellanox MOFED Driver 23.10-0.4.1.0
- Mellanox NV Peer Memory Driver 1.1-0
- RDMA Shared Device Plugin 1.3.2
Expand Down Expand Up @@ -270,7 +270,7 @@ EOF
Now execute the below to install kubelet, kubeadm, and kubectl:

```
sudo dnf install -y kubelet-1.28.3 kubeadm-1.28.3 kubectl-1.28.3
sudo dnf install -y kubelet-1.28.2 kubeadm-1.28.2 kubectl-1.28.2
```

Create a kubelet default with your container runtime:
Expand Down Expand Up @@ -319,13 +319,13 @@ UUID=DCD4-535C /boot/efi vfat defaults 0 0
Execute the following command for `Containerd` systems:

```
sudo kubeadm init --pod-network-cidr=192.168.32.0/22 --cri-socket=/run/containerd/containerd.sock --kubernetes-version="v1.28.3"
sudo kubeadm init --pod-network-cidr=192.168.32.0/22 --cri-socket=/run/containerd/containerd.sock --kubernetes-version="v1.28.2"
```

Eecute the following command for `CRI-O` systems:

```
sudo kubeadm init --pod-network-cidr=192.168.32.0/22 --cri-socket=unix:/run/crio/crio.sock --kubernetes-version="v1.28.3"
sudo kubeadm init --pod-network-cidr=192.168.32.0/22 --cri-socket=unix:/run/crio/crio.sock --kubernetes-version="v1.28.2"
```

Output:
Expand Down Expand Up @@ -410,7 +410,7 @@ Output:

```
NAME STATUS ROLES AGE VERSION
#yourhost Ready control-plane 10m v1.28.3
#yourhost Ready control-plane 10m v1.28.2
```

Since we are using a single-node Kubernetes cluster, the cluster will not schedule pods on the control plane node by default. To schedule pods on the control plane node, we have to remove the taint by executing the following command:
Expand Down Expand Up @@ -497,8 +497,8 @@ Output:

```
NAME STATUS ROLES AGE VERSION
#yourhost Ready control-plane 10m v1.28.3
#yourhost-worker Ready 10m v1.28.3
#yourhost Ready control-plane 10m v1.28.2
#yourhost-worker Ready 10m v1.28.2
```

### Adding an Additional Node to NVIDIA Cloud Native Stack
Expand Down Expand Up @@ -535,8 +535,8 @@ Output:

```
NAME STATUS ROLES AGE VERSION
#yourhost Ready control-plane 10m v1.28.3
#yourhost-worker Ready 10m v1.28.3
#yourhost Ready control-plane 10m v1.28.2
#yourhost-worker Ready 10m v1.28.2
```

### Installing NVIDIA Network Operator
Expand Down Expand Up @@ -597,7 +597,7 @@ Update the Helm repo:
Install Network Operator:
```
kubectl label nodes --all node-role.kubernetes.io/master- --overwrite
helm install -f --version 23.5.0 ./network-operator-values.yaml -n network-operator --create-namespace --wait network-operator mellanox/network-operator
helm install -f --version 23.10.0 ./network-operator-values.yaml -n network-operator --create-namespace --wait network-operator mellanox/network-operator
```
#### Validating the State of the Network Operator

Expand Down Expand Up @@ -641,7 +641,7 @@ Install GPU Operator:
`NOTE:` If you installed Network Operator, please skip the below command and follow the [GPU Operator with RDMA](#GPU-Operator-with-RDMA)

```
helm install --version 23.9.0 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --wait --generate-name
helm install --version 23.9.1 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --wait --generate-name
```

#### GPU Operator with RDMA
Expand All @@ -652,15 +652,15 @@ Install GPU Operator:
After Network Operator installation is completed, execute the below command to install the GPU Operator to load nv_peer_mem modules:

```
helm install --version 23.9.0 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --set driver.rdma.enabled=true --wait --generate-name
helm install --version 23.9.1 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --set driver.rdma.enabled=true --wait --generate-name
```

#### GPU Operator with Host MOFED Driver and RDMA

If the host is already installed MOFED driver without network operator, execute the below command to install the GPU Operator to load nv_peer_mem module

```
helm install --version 23.9.0 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --set driver.rdma.enabled=true,driver.rdma.useHostMofed=true --wait --generate-name
helm install --version 23.9.1 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --set driver.rdma.enabled=true,driver.rdma.useHostMofed=true --wait --generate-name
```

Expand All @@ -669,7 +669,7 @@ If the host is already installed MOFED driver without network operator, execute
Execute the below command to enable the GPU Direct Storage Driver on GPU Operator

```
helm install --version 23.9.0 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --set gds.enabled=true
helm install --version 23.9.1 --create-namespace --namespace nvidia-gpu-operator nvidia/gpu-operator --set gds.enabled=true
```
For more information refer, [GPU Direct Storage](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/gpu-operator-rdma.html)

Expand Down Expand Up @@ -1160,7 +1160,7 @@ Execute the below commands to uninstall the GPU Operator:
```
$ helm ls
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator-1606173805 nvidia-gpu-operator 1 2023-03-14 20:23:28.063421701 +0000 UTC deployed gpu-operator-23.9.0 v23.3.2
gpu-operator-1606173805 nvidia-gpu-operator 1 2023-03-14 20:23:28.063421701 +0000 UTC deployed gpu-operator-23.9.1 v23.3.2
$ helm del gpu-operator-1606173805 -n nvidia-gpu-operator
Expand All @@ -1173,7 +1173,7 @@ Execute the below commands to uninstall the Network Operator:
```
$ helm ls -n network-operator
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
network-operator network-operator 1 2023-04-03 17:09:04.665593336 +0000 UTC deployed network-operator-23.5.0 v23.5.0
network-operator network-operator 1 2023-04-03 17:09:04.665593336 +0000 UTC deployed network-operator-23.10. v23.10.0
$ helm del network-operator -n network-operator
```
Loading

0 comments on commit f16d9bf

Please sign in to comment.