Should your Akash Provider encounter issues during the installation process or in post install hosting of GPU resources, follow the troubleshooting steps in this guide to isolate the issue.
NOTE - these steps should be conducted on each Akask Provider/Kubernetes worker nodes that host GPU resources unless stated otherwise within the step
- Basic GPU Resource Verifications
- Examine Linux Kernel Logs for GPU Resource Errors and Mismatches
- Ensure Correct Version/Presence of NVIDIA Device Plugin
- CUDA Drivers Fabric Manager
- Conduct the steps in this section for basic verification and to ensure the host has access to GPU resources
apt update && apt -y install python3-venv
python3 -m venv /venv
source /venv/bin/activate
pip install torch numpy
NOTE - example verification steps were conducted on a host with a single NVIDIA T4 GPU resource. Your output will be different based on the type and number of GPU resources on the host.
nvidia-smi -L
# nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-faa48437-7587-4bc1-c772-8bd099dba462)
python3 -c "import torch;print(torch.version.cuda)"
# python3 -c "import torch;print(torch.version.cuda)"
11.7
python3 -c "import torch; print(torch.cuda.is_available())"
# python3 -c "import torch; print(torch.cuda.is_available())"
True
dmesg -T | grep -Ei 'nvidia|nvml|cuda|mismatch'
NOTE - example output is from a healthy host which loaded NVIDIA drivers successfully and has no version mismatches. Your output may look very different if there are issues within the host.
# dmesg -T | grep -Ei 'nvidia|nvml|cuda|mismatch'
[Thu Sep 28 19:29:02 2023] nvidia: loading out-of-tree module taints kernel.
[Thu Sep 28 19:29:02 2023] nvidia: module license 'NVIDIA' taints kernel.
[Thu Sep 28 19:29:02 2023] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[Thu Sep 28 19:29:02 2023] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.104.05 Sat Aug 19 01:15:15 UTC 2023
[Thu Sep 28 19:29:02 2023] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.104.05 Sat Aug 19 00:59:57 UTC 2023
[Thu Sep 28 19:29:02 2023] [drm] [nvidia-drm] [GPU ID 0x00000004] Loading driver
[Thu Sep 28 19:29:03 2023] audit: type=1400 audit(1695929343.571:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=300 comm="apparmor_parser"
[Thu Sep 28 19:29:03 2023] audit: type=1400 audit(1695929343.571:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=300 comm="apparmor_parser"
[Thu Sep 28 19:29:04 2023] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:04.0 on minor 0
[Thu Sep 28 19:29:05 2023] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[Thu Sep 28 19:29:05 2023] nvidia-uvm: Loaded the UVM driver, major device number 235.
NOTE - conduct this verification step on the Kubernetes control plane node on which Helm was installed during your Akash Provider build
helm -n nvidia-device-plugin list
# helm -n nvidia-device-plugin list
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
nvdp nvidia-device-plugin 1 2023-09-23 14:30:34.18183027 +0200 CEST deployed nvidia-device-plugin-0.14.1 0.14.1
- In some circumstances it has been found that the CUDA Drivers Fabric Manager needs to be installed on worker nodes hosting GPU resources (e.g. non-PCIe GPU configurations such as those using SXM form factors)
- If the output of the
torch.cuda.is_available()
command - covered in prior section in this doc - is an error condition, consider installing the CUDA Drivers Fabric Manager to resolve issue - Frequently encountered error message encounter when issue exists:
torch.cuda.is_available() function: Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
- Further details on the CUDA Drivers Fabric Manager are available here
NOTE - replace
525
in the following command with the NVIDIA driver version used on your host
apt-get install cuda-drivers-fabricmanager-525