pytorch could not detect Nvidia driver on bottlerocket #3916

chulkilee · 2024-04-25T23:21:01Z

Sorry, I don't have the whole details, but I'd like to report that I had issues using pytorch on Bottlerocket image for EKS.

When I switch to AL2 GPU AMI, it worked without an issue.

EKS 1.29
node group with the default launch template (so the latest AMI image)
instance type: g4dn.xlarge
The EKS cluster don't use nvidia device driver / gpu operator,

AMI

BOTTLEROCKET_X86_64_NVIDIA: ami-0d31d8d1285f91827 - bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.19.4-4f0a078e
AL2_x86_64_GPU: ami-093bb52bc444e09ba - amazon-eks-gpu-node-1.29-v20240415

In both AMIs nvidia kernel mod seems to be loaded.. but with different params.

cat /proc/driver/nvidia/version

BOTTLEROCKET_x86_64_NVIDIA:

NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.161.07  Sat Feb 17 22:55:48 UTC 2024
GCC version:  gcc version 11.3.0 (Buildroot 2022.11.1)

AL2_x86_64_GPU:

NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.161.08  Tue Mar  5 22:42:15 UTC 2024
GCC version:  gcc version 10.5.0 20230707 (Red Hat 10.5.0-1) (GCC)

cat /proc/driver/nvidia/params

BOTTLEROCKET_x86_64_NVIDIA:

ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""

AL2_x86_64_GPU:

ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 0
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 0
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""

However, pytorch failed to detect the driver in Bottleoeckt

Only in BOTTLEROCKET_x86_64_NVIDIA:

python -c "import torch; torch.cuda.current_device()"

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

python -m torch.utils.collect_env:

BOTTLEROCKET_x86_64_NVIDIA:

PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.82-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

AL2_x86_64_GPU:

PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.213-201.855.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Used python packages

[pip3] numpy==1.26.3
[pip3] pytorch-lightning==2.1.3
[pip3] pytorch-metric-learning==2.4.1
[pip3] torch==2.0.1+cu117
[pip3] torch-audiomentations==0.11.0
[pip3] torch-pitch-shift==1.2.4
[pip3] torchaudio==2.0.2
[pip3] torchmetrics==1.3.0.post0

Could it be related to awslabs/amazon-eks-ami#1523 ?

The text was updated successfully, but these errors were encountered:

chulkilee · 2024-04-25T23:40:03Z

If it needs to be reported to https://github.com/awslabs/amazon-eks-ami/issues then please let me know.

yeazelm · 2024-04-25T23:50:23Z

Hello @chulkilee, thanks for cutting this issue! I don't believe this would be related to GSP on g4dn.xlarge instances but you could follow #3817 (comment) just to confirm that isn't the problem.

The difference in the output between Bottlerocket and Amazon Linux for the module config is:

Bottlerocket: ModifyDeviceFiles: 1
Amazon Linux: ModifyDeviceFiles: 0

Bottlerocket: EnableGpuFirmware: 18
Amazon Linux: EnableGpuFirmware: 0

EnableGpuFirmware is the GSP change and ModifyDeviceFiles will disable dynamic device file management when set to 0.

What is strange is that pytorch is reporting that CUDA is not available when it really should be since the other things you called out are there.

Can you also confirm what your podspec looks like just to make sure all the right settings are being passed from that perspective?

yeazelm · 2024-04-27T20:16:20Z

Hello @chulkilee, I just tried using an image from NVIDIA to confirm that pytorch can see the devices on a g4dn.xlarge node with latest bottlerocket and I don't get the same issue:

# python -c "import torch; print(torch.cuda.get_device_name(0))"
Tesla T4

Can you confirm which base container you are using and which CUDA version is included? I'm not able to replicate with the image I got.

bryantbiggs · 2024-05-01T20:15:53Z

@chulkilee do your container images contain the following environment variables?

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

If not, I would suggest adding them

chulkilee · 2024-05-06T22:30:01Z

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

Those were set. I'm using nvidia/cuda:11.8.0-base-ubuntu22.04 image - but still failing.

Update

declare -x CUDA_VERSION="11.8.0"
declare -x NVIDIA_REQUIRE_CUDA="cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=tesla,driver>=515,driver<516 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516"
declare -x NV_CUDA_COMPAT_PACKAGE="cuda-compat-11-8"
declare -x NV_CUDA_CUDART_VERSION="11.8.89-1"

Even I unset NVIDIA_REQUIRE_CUDA - it still fails with the same error.

I also tested the same image with 1.19.4-4f0a078e and 1.19.5-64049ba8 AMI releases - both failed.

arnaldo2792 · 2024-05-09T00:32:25Z

@chulkilee , are you requesting GPUs in your pod specs? Or, do you need to oversubscribe your GPUs and thus you use NVIDIA_VISIBLE_DEVICES=all to get access to all the GPUs in the instance from your pod?

chulkilee · 2024-08-19T20:20:57Z

I tested the g5g.xlarge instance with the BOTTLEROCKET_ARM_64_NVIDIA image using the nvcr.io/nvidia/pytorch:24.03-py3 container. I observed that CUDA is only detected when the GPU resource is explicitly specified.

According to the Kubernetes documentation, GPUs can be utilized by requesting the custom GPU resource. However, the documentation does not clarify the expected behavior when the GPU resource is not specified, even if GPUs are available on the node.

You can consume these GPUs from your containers by requesting the custom GPU resource, the same way you request CPU or memory. However, there are some limitations in how you specify the resource requirements for custom devices.

It's important to note that this behavior differs from what is observed when using the Amazon Linux image.

pod yaml

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  nodeSelector:
    node.kubernetes.io/instance-type: g5g.xlarge
  containers:
    - name: shell
      image: nvcr.io/nvidia/pytorch:24.03-py3
      command: [sleep, "3600"]
    - name: shell2
      image: nvcr.io/nvidia/pytorch:24.03-py3
      command: [sleep, "3600"]
      resources:
        limits:
          nvidia.com/gpu: 1

chiragjn · 2024-09-06T06:58:18Z

I observed that CUDA is only detected when the GPU resource is explicitly specified.

I believe this is post my PR #3718 which enables correct allocation of GPUs
Idea is that just by mentioning ENV NVIDIA_VISIBLE_DEVICES all in the image, a container should not be able to steal all gpus on the node

It's important to note that this behavior differs from what is observed when using the Amazon Linux image.

This is correct, ideally AL2 images should also be enabling the following config/env on container toolkit

ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED=false
ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS=true

to enable correct allocation and isolation. Someone really needs to push for this, alas it might face some backlash because it changes behavior in backwards incompatible way

chulkilee · 2024-09-06T07:43:19Z

Thanks for the confirmation! I agree that allocating gpu only when requested is better behavior... but changing AL is not easy.. I hope it happens on major AL version bump..

I'm closing this as this is not an issue on bottlerocket side.

arnaldo2792 · 2024-09-12T03:03:28Z

FWIW @chiragjn, there are users that relied on the NVIDIA_VISIBLE_DEVICES=all behavior that you helped us disable (thanks again). @chulkilee , we will be exposing an API to allow changing the default configurations for the settings described above in an upcoming release (see #4182), and we are planning to add support for time slicing soon, so that a GPU can be oversubscribed but with more control on which pods get access to it.

chulkilee added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Apr 25, 2024

Discipe mentioned this issue May 2, 2024

Fail to detect GPU on Bottlerocket v1.19 within AWS g4dn instance #3937

Closed

vigh-m added area/accelerated-computing Issues related to GPUs/ASICs and removed status/needs-triage Pending triage or re-evaluation labels May 14, 2024

chulkilee closed this as completed Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch could not detect Nvidia driver on bottlerocket #3916

pytorch could not detect Nvidia driver on bottlerocket #3916

chulkilee commented Apr 25, 2024 •

edited

Loading

chulkilee commented Apr 25, 2024

yeazelm commented Apr 25, 2024

yeazelm commented Apr 27, 2024

bryantbiggs commented May 1, 2024

chulkilee commented May 6, 2024 •

edited

Loading

arnaldo2792 commented May 9, 2024

chulkilee commented Aug 19, 2024 •

edited

Loading

chiragjn commented Sep 6, 2024 •

edited

Loading

chulkilee commented Sep 6, 2024

arnaldo2792 commented Sep 12, 2024

pytorch could not detect Nvidia driver on bottlerocket #3916

pytorch could not detect Nvidia driver on bottlerocket #3916

Comments

chulkilee commented Apr 25, 2024 • edited Loading

chulkilee commented Apr 25, 2024

yeazelm commented Apr 25, 2024

yeazelm commented Apr 27, 2024

bryantbiggs commented May 1, 2024

chulkilee commented May 6, 2024 • edited Loading

arnaldo2792 commented May 9, 2024

chulkilee commented Aug 19, 2024 • edited Loading

chiragjn commented Sep 6, 2024 • edited Loading

chulkilee commented Sep 6, 2024

arnaldo2792 commented Sep 12, 2024

chulkilee commented Apr 25, 2024 •

edited

Loading

chulkilee commented May 6, 2024 •

edited

Loading

chulkilee commented Aug 19, 2024 •

edited

Loading

chiragjn commented Sep 6, 2024 •

edited

Loading