nvidia/gpu-operator exposes all GPUs to a pod configured with securityContext.privileged=false #700

CuiDengdeng · 2024-04-16T14:01:47Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04):
*Ubuntu20,04
Kernel Version:
5.15.0-67-generic
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
k8s 1.28.2
GPU Operator Version:
v22.9.0

2. Issue or feature description

Hi,I reproduce this issue(#421). I deployed gpu-operator and then created a pod configured with securityContext.privileged=false. Then the pod is running. Why are all GPUs still exposed to the Pod？

3. Steps to reproduce the issue

helm install nvidia/gpu-operator
--version=v22.9.0
--generate-name
--create-namespace
--namespace=gpu
--set driver.enabled=false
--set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY
--set devicePlugin.env[0].value="volume-mounts"
--set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
--set-string toolkit.env[0].value='false'
--set toolkit.env[1].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
--set-string toolkit.env[1].value='true'
--wait

apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:

name: cuda-vectoradd
image: docker.io/library/nginx:latest
resources:
limits:
cpu: 900m
nvidia.com/gpu: 1
requests:
cpu: 900m
nvidia.com/gpu: 1

4. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
containerd logs journalctl -u containerd > containerd.log

The text was updated successfully, but these errors were encountered:

CuiDengdeng · 2024-04-16T14:03:29Z

@hoangtnm @shivamerla @elezar Thanks!

shivamerla · 2024-04-23T03:50:32Z

@CuiDengdeng can you attach the toolkit config file: /usr/local/nvidia/toolkit/nvidia-container-runtime/config.toml. Also, can you paste the output of nvidia-smi from within the test pod which indicates access to all GPUs. Enabling debug mode in the toolkit config file and attaching nvidia-container-runtime log will help as well.

CuiDengdeng · 2024-04-28T12:44:33Z

@CuiDengdeng can you attach the toolkit config file: /usr/local/nvidia/toolkit/nvidia-container-runtime/config.toml. Also, can you paste the output of nvidia-smi from within the test pod which indicates access to all GPUs. Enabling debug mode in the toolkit config file and attaching nvidia-container-runtime log will help as well.

@shivamerla thanks，I have solved this problem, but I want to know which component initializes the environment variable（NVIDIA_VISIBLE_DEVICES） to all when not applying for GPU

cdesiniotis · 2024-05-02T21:04:33Z

@CuiDengdeng NVIDIA_VISIBLE_DEVICES environment variable is set to all in the official CUDA images (nvcr.io/nvidia/cuda). So if your container image builds off the CUDA image, then this envvar will be set.

I am closing this issue since you have indicated your problem has been solved.

cdesiniotis closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia/gpu-operator exposes all GPUs to a pod configured with securityContext.privileged=false #700

nvidia/gpu-operator exposes all GPUs to a pod configured with securityContext.privileged=false #700

CuiDengdeng commented Apr 16, 2024 •

edited

Loading

CuiDengdeng commented Apr 16, 2024 •

edited

Loading

shivamerla commented Apr 23, 2024

CuiDengdeng commented Apr 28, 2024

cdesiniotis commented May 2, 2024

nvidia/gpu-operator exposes all GPUs to a pod configured with securityContext.privileged=false #700

nvidia/gpu-operator exposes all GPUs to a pod configured with securityContext.privileged=false #700

Comments

CuiDengdeng commented Apr 16, 2024 • edited Loading

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

CuiDengdeng commented Apr 16, 2024 • edited Loading

shivamerla commented Apr 23, 2024

CuiDengdeng commented Apr 28, 2024

cdesiniotis commented May 2, 2024

CuiDengdeng commented Apr 16, 2024 •

edited

Loading

CuiDengdeng commented Apr 16, 2024 •

edited

Loading