Update MIG docs

NVIDIA · Feb 15, 2022 · 744d2a1 · 744d2a1
1 parent fefd906
commit 744d2a1
Showing 1 changed file with 14 additions and 15 deletions.
diff --git a/docs/k8s-cluster/nvidia-mig.md b/docs/k8s-cluster/nvidia-mig.md
@@ -2,58 +2,57 @@
 
 Multi-Instance GPU or MIG is a feature introduced in the NVIDIA A100 GPUs that allow a single GPU to be partitioned into several smaller GPUs. For more information see the [NVIDIA MIG page](https://www.nvidia.com/en-us/technologies/multi-instance-gpu/).
 
+There are two methods that can be used to administer MIG. This guide details the K8s native method that relies on the NVIDIA MIG Manager service included with the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) and installed by default if `deepops_gpu_operator_enabled` is set to `true`. The alternative method is a bare-metal solution using the mig-parted systemd service which can be installed using the [nvidia-mig.yml](../../playbooks/nvidia-software/nvidia-mig.yml) playbook and configured following the [official documentation](https://github.com/NVIDIA/mig-parted).
+
 Supporting MIG requires several administrative steps and open source projects.
 
-*Projects:*
+*Projects, included in GPU Operator v1.9.0+:*
 * [GPU Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
 * [GPU Feature Discovery](https://github.com/NVIDIA/gpu-feature-discovery)
+* [NVIDIA K8s MIG Manager](https://github.com/NVIDIA/mig-parted/tree/master/deployments/gpu-operator)
 
 *Admin Steps:*
 * Enable MIG
-* Configure MIG (kubernetes)
-* Configure MIG (bare-metal)
+* Configure MIG mode ('single' or 'mixed')
+* Configure MIG (Kubernetes configmap)
 * Update Application/YAML to support MIG
 
 
 ## Enabling MIG
 
-MIG can be enabled on a node by running the `playbooks/nvidia-software/nvidia-mig.yml` playbook.
+The K8s MIG Manager will handle enabling and disabling MIG on all devices, as necessary.
 
 There are some caveats depending on the state of your cluster and a node reboot may be necessary.
 
 
-## Installing MIG in Kubernetes
+## Configuring MIG Mode in Kubernetes
 
 By default, MIG support for Kubernetes is enabled in DeepOps. The default MIG strategy used is set to `mixed`. This can be controlled by the `k8s_gpu_mig_strategy`variable in `config/group_vars/k8s-cluster.yml. The "mixed" strategy is recommended for new deployments. For more information about strategies see the GPU Device Plugin [README](https://github.com/NVIDIA/k8s-device-plugin#deployment-via-helm).
 
 If DeepOps is being used to manage a Kubernetes cluster that was deployed using another method, MIG can be enabled by running:
 
 ```sh
-ansible-playbook playbooks/k8s-cluster/nvidia-k8s-gpu-device-plugin.yml playbooks/k8s-cluster/nvidia-k8s-gpu-feature-discovery.yml
+ansible-playbook playbooks/k8s-cluster/nvidia-gpu-operator.yml
 ```
 > Note, the same command can be used to re-configure a new strategy
 
-## Configuring MIG
+## Configuring MIG Devices
 
-MIG devices must be configured after enabling MIG and after **every** node reboot. When in production, it is recommended to do a rolling upgrade node-by-node following the below steps on each GPU node.
+MIG devices are configured on a per-node or cluster-wide basis depending on the MIG configmap and the node labels applied to each node. When in production, it is recommended to do a rolling upgrade node-by-node following the below steps on each GPU node.
 
 Configuration and reconfiguration require that you:
 
 1. Taint your node
 2. Evacuate all GPU pods
 3. Configure MIG
-4. Restart the GPU Device Plugin Pod
-5. Wait for GPU Feature Discovery to re-label the node
-6. Remove the taint.
+6. Remove the taint
 
 ```sh
 kubectl taint node gpu01 mig=maintenance:NoSchedule
 kubectl taint node gpu01 mig=maintenance:NoExecute # Optionally, Deep Learning jobs and Notebooks could be allowed to "time out"
-
-<Manual configuration steps>
-
-kubectl exec <GPU Device Plugin Pod on gpu01> -- kill -SIGTERM 1
+kubectl label node gpu01 nvidia.com/mig.config=all-1g.5gb
 sleep 60 # 60 seconds is the default polling period of GPU Feature Discovery
+kubectl describe node gpu01 # Manual verification of MIG resources
 kubectl taint node gpu01 mig=maintenance:NoSchedule-
 kubectl taint node gpu01 mig=maintenance:NoExecute-
 ```