Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch could not detect Nvidia driver on bottlerocket #3916

Closed
chulkilee opened this issue Apr 25, 2024 · 10 comments
Closed

pytorch could not detect Nvidia driver on bottlerocket #3916

chulkilee opened this issue Apr 25, 2024 · 10 comments
Labels
area/accelerated-computing Issues related to GPUs/ASICs type/bug Something isn't working

Comments

@chulkilee
Copy link

chulkilee commented Apr 25, 2024

Sorry, I don't have the whole details, but I'd like to report that I had issues using pytorch on Bottlerocket image for EKS.

When I switch to AL2 GPU AMI, it worked without an issue.

  • EKS 1.29
  • node group with the default launch template (so the latest AMI image)
  • instance type: g4dn.xlarge
  • The EKS cluster don't use nvidia device driver / gpu operator,

AMI

  • BOTTLEROCKET_X86_64_NVIDIA: ami-0d31d8d1285f91827 - bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.19.4-4f0a078e
  • AL2_x86_64_GPU: ami-093bb52bc444e09ba - amazon-eks-gpu-node-1.29-v20240415

In both AMIs nvidia kernel mod seems to be loaded.. but with different params.

cat /proc/driver/nvidia/version

BOTTLEROCKET_x86_64_NVIDIA:

NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.161.07  Sat Feb 17 22:55:48 UTC 2024
GCC version:  gcc version 11.3.0 (Buildroot 2022.11.1)

AL2_x86_64_GPU:

NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.161.08  Tue Mar  5 22:42:15 UTC 2024
GCC version:  gcc version 10.5.0 20230707 (Red Hat 10.5.0-1) (GCC)

cat /proc/driver/nvidia/params

BOTTLEROCKET_x86_64_NVIDIA:

ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""

AL2_x86_64_GPU:

ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 0
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 0
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""

However, pytorch failed to detect the driver in Bottleoeckt

Only in BOTTLEROCKET_x86_64_NVIDIA:

python -c "import torch; torch.cuda.current_device()"

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

python -m torch.utils.collect_env:

BOTTLEROCKET_x86_64_NVIDIA:

PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.82-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

AL2_x86_64_GPU:

PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.213-201.855.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Used python packages

[pip3] numpy==1.26.3
[pip3] pytorch-lightning==2.1.3
[pip3] pytorch-metric-learning==2.4.1
[pip3] torch==2.0.1+cu117
[pip3] torch-audiomentations==0.11.0
[pip3] torch-pitch-shift==1.2.4
[pip3] torchaudio==2.0.2
[pip3] torchmetrics==1.3.0.post0

Could it be related to awslabs/amazon-eks-ami#1523 ?

@chulkilee chulkilee added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Apr 25, 2024
@chulkilee
Copy link
Author

If it needs to be reported to https://github.com/awslabs/amazon-eks-ami/issues then please let me know.

@yeazelm
Copy link
Contributor

yeazelm commented Apr 25, 2024

Hello @chulkilee, thanks for cutting this issue! I don't believe this would be related to GSP on g4dn.xlarge instances but you could follow #3817 (comment) just to confirm that isn't the problem.

The difference in the output between Bottlerocket and Amazon Linux for the module config is:

Bottlerocket: ModifyDeviceFiles: 1
Amazon Linux: ModifyDeviceFiles: 0

Bottlerocket: EnableGpuFirmware: 18
Amazon Linux: EnableGpuFirmware: 0

EnableGpuFirmware is the GSP change and ModifyDeviceFiles will disable dynamic device file management when set to 0.

What is strange is that pytorch is reporting that CUDA is not available when it really should be since the other things you called out are there.

Can you also confirm what your podspec looks like just to make sure all the right settings are being passed from that perspective?

@yeazelm
Copy link
Contributor

yeazelm commented Apr 27, 2024

Hello @chulkilee, I just tried using an image from NVIDIA to confirm that pytorch can see the devices on a g4dn.xlarge node with latest bottlerocket and I don't get the same issue:

# python -c "import torch; print(torch.cuda.get_device_name(0))"
Tesla T4

Can you confirm which base container you are using and which CUDA version is included? I'm not able to replicate with the image I got.

@bryantbiggs
Copy link
Contributor

@chulkilee do your container images contain the following environment variables?

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

If not, I would suggest adding them

@chulkilee
Copy link
Author

chulkilee commented May 6, 2024

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

Those were set. I'm using nvidia/cuda:11.8.0-base-ubuntu22.04 image - but still failing.

Update

declare -x CUDA_VERSION="11.8.0"
declare -x NVIDIA_REQUIRE_CUDA="cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=tesla,driver>=515,driver<516 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516"
declare -x NV_CUDA_COMPAT_PACKAGE="cuda-compat-11-8"
declare -x NV_CUDA_CUDART_VERSION="11.8.89-1"

Even I unset NVIDIA_REQUIRE_CUDA - it still fails with the same error.

I also tested the same image with 1.19.4-4f0a078e and 1.19.5-64049ba8 AMI releases - both failed.

@arnaldo2792
Copy link
Contributor

@chulkilee , are you requesting GPUs in your pod specs? Or, do you need to oversubscribe your GPUs and thus you use NVIDIA_VISIBLE_DEVICES=all to get access to all the GPUs in the instance from your pod?

@vigh-m vigh-m added area/accelerated-computing Issues related to GPUs/ASICs and removed status/needs-triage Pending triage or re-evaluation labels May 14, 2024
@chulkilee
Copy link
Author

chulkilee commented Aug 19, 2024

I tested the g5g.xlarge instance with the BOTTLEROCKET_ARM_64_NVIDIA image using the nvcr.io/nvidia/pytorch:24.03-py3 container. I observed that CUDA is only detected when the GPU resource is explicitly specified.

According to the Kubernetes documentation, GPUs can be utilized by requesting the custom GPU resource. However, the documentation does not clarify the expected behavior when the GPU resource is not specified, even if GPUs are available on the node.

You can consume these GPUs from your containers by requesting the custom GPU resource, the same way you request CPU or memory. However, there are some limitations in how you specify the resource requirements for custom devices.

It's important to note that this behavior differs from what is observed when using the Amazon Linux image.

pod yaml

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  nodeSelector:
    node.kubernetes.io/instance-type: g5g.xlarge
  containers:
    - name: shell
      image: nvcr.io/nvidia/pytorch:24.03-py3
      command: [sleep, "3600"]
    - name: shell2
      image: nvcr.io/nvidia/pytorch:24.03-py3
      command: [sleep, "3600"]
      resources:
        limits:
          nvidia.com/gpu: 1

@chiragjn
Copy link
Contributor

chiragjn commented Sep 6, 2024

I observed that CUDA is only detected when the GPU resource is explicitly specified.

I believe this is post my PR #3718 which enables correct allocation of GPUs
Idea is that just by mentioning ENV NVIDIA_VISIBLE_DEVICES all in the image, a container should not be able to steal all gpus on the node

It's important to note that this behavior differs from what is observed when using the Amazon Linux image.

This is correct, ideally AL2 images should also be enabling the following config/env on container toolkit

ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED=false
ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS=true

to enable correct allocation and isolation. Someone really needs to push for this, alas it might face some backlash because it changes behavior in backwards incompatible way

@chulkilee
Copy link
Author

Thanks for the confirmation! I agree that allocating gpu only when requested is better behavior... but changing AL is not easy.. I hope it happens on major AL version bump..

I'm closing this as this is not an issue on bottlerocket side.

@arnaldo2792
Copy link
Contributor

FWIW @chiragjn, there are users that relied on the NVIDIA_VISIBLE_DEVICES=all behavior that you helped us disable (thanks again). @chulkilee , we will be exposing an API to allow changing the default configurations for the settings described above in an upcoming release (see #4182), and we are planning to add support for time slicing soon, so that a GPU can be oversubscribed but with more control on which pods get access to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/accelerated-computing Issues related to GPUs/ASICs type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants