-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch could not detect Nvidia driver on bottlerocket #3916
Comments
If it needs to be reported to https://github.com/awslabs/amazon-eks-ami/issues then please let me know. |
Hello @chulkilee, thanks for cutting this issue! I don't believe this would be related to GSP on g4dn.xlarge instances but you could follow #3817 (comment) just to confirm that isn't the problem. The difference in the output between Bottlerocket and Amazon Linux for the module config is:
EnableGpuFirmware is the GSP change and ModifyDeviceFiles will disable dynamic device file management when set to 0. What is strange is that pytorch is reporting that CUDA is not available when it really should be since the other things you called out are there. Can you also confirm what your podspec looks like just to make sure all the right settings are being passed from that perspective? |
Hello @chulkilee, I just tried using an image from NVIDIA to confirm that pytorch can see the devices on a g4dn.xlarge node with latest bottlerocket and I don't get the same issue:
Can you confirm which base container you are using and which CUDA version is included? I'm not able to replicate with the image I got. |
@chulkilee do your container images contain the following environment variables?
If not, I would suggest adding them |
Those were set. I'm using Update
Even I unset I also tested the same image with |
@chulkilee , are you requesting GPUs in your pod specs? Or, do you need to oversubscribe your GPUs and thus you use |
I tested the According to the Kubernetes documentation, GPUs can be utilized by requesting the custom GPU resource. However, the documentation does not clarify the expected behavior when the GPU resource is not specified, even if GPUs are available on the node.
It's important to note that this behavior differs from what is observed when using the Amazon Linux image. pod yaml apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
nodeSelector:
node.kubernetes.io/instance-type: g5g.xlarge
containers:
- name: shell
image: nvcr.io/nvidia/pytorch:24.03-py3
command: [sleep, "3600"]
- name: shell2
image: nvcr.io/nvidia/pytorch:24.03-py3
command: [sleep, "3600"]
resources:
limits:
nvidia.com/gpu: 1 |
I believe this is post my PR #3718 which enables correct allocation of GPUs
This is correct, ideally AL2 images should also be enabling the following config/env on container toolkit
to enable correct allocation and isolation. Someone really needs to push for this, alas it might face some backlash because it changes behavior in backwards incompatible way |
Thanks for the confirmation! I agree that allocating gpu only when requested is better behavior... but changing AL is not easy.. I hope it happens on major AL version bump.. I'm closing this as this is not an issue on bottlerocket side. |
FWIW @chiragjn, there are users that relied on the |
Sorry, I don't have the whole details, but I'd like to report that I had issues using pytorch on Bottlerocket image for EKS.
When I switch to AL2 GPU AMI, it worked without an issue.
AMI
In both AMIs nvidia kernel mod seems to be loaded.. but with different params.
cat /proc/driver/nvidia/version
BOTTLEROCKET_x86_64_NVIDIA:
AL2_x86_64_GPU:
cat /proc/driver/nvidia/params
BOTTLEROCKET_x86_64_NVIDIA:
AL2_x86_64_GPU:
However, pytorch failed to detect the driver in Bottleoeckt
Only in BOTTLEROCKET_x86_64_NVIDIA:
python -m torch.utils.collect_env
:BOTTLEROCKET_x86_64_NVIDIA:
AL2_x86_64_GPU:
Used python packages
Could it be related to awslabs/amazon-eks-ami#1523 ?
The text was updated successfully, but these errors were encountered: