Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia/gpu-operator exposes all GPUs to a pod configured with securityContext.privileged=false #700

Closed
6 tasks
CuiDengdeng opened this issue Apr 16, 2024 · 4 comments

Comments

@CuiDengdeng
Copy link

CuiDengdeng commented Apr 16, 2024

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):
    *Ubuntu20,04
  • Kernel Version:
  • 5.15.0-67-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
  • Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
  • k8s 1.28.2
  • GPU Operator Version:
  • v22.9.0

2. Issue or feature description

Hi,I reproduce this issue(#421). I deployed gpu-operator and then created a pod configured with securityContext.privileged=false. Then the pod is running. Why are all GPUs still exposed to the Pod?

3. Steps to reproduce the issue

helm install nvidia/gpu-operator
--version=v22.9.0
--generate-name
--create-namespace
--namespace=gpu
--set driver.enabled=false
--set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY
--set devicePlugin.env[0].value="volume-mounts"
--set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
--set-string toolkit.env[0].value='false'
--set toolkit.env[1].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
--set-string toolkit.env[1].value='true'
--wait

apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:

  • name: cuda-vectoradd
    image: docker.io/library/nginx:latest
    resources:
    limits:
    cpu: 900m
    nvidia.com/gpu: 1
    requests:
    cpu: 900m
    nvidia.com/gpu: 1

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log
@CuiDengdeng
Copy link
Author

CuiDengdeng commented Apr 16, 2024

@hoangtnm @shivamerla @elezar Thanks!

@shivamerla
Copy link
Contributor

@CuiDengdeng can you attach the toolkit config file: /usr/local/nvidia/toolkit/nvidia-container-runtime/config.toml. Also, can you paste the output of nvidia-smi from within the test pod which indicates access to all GPUs. Enabling debug mode in the toolkit config file and attaching nvidia-container-runtime log will help as well.

@CuiDengdeng
Copy link
Author

@CuiDengdeng can you attach the toolkit config file: /usr/local/nvidia/toolkit/nvidia-container-runtime/config.toml. Also, can you paste the output of nvidia-smi from within the test pod which indicates access to all GPUs. Enabling debug mode in the toolkit config file and attaching nvidia-container-runtime log will help as well.

@shivamerla thanks,I have solved this problem, but I want to know which component initializes the environment variable(NVIDIA_VISIBLE_DEVICES) to all when not applying for GPU

@cdesiniotis
Copy link
Contributor

@CuiDengdeng NVIDIA_VISIBLE_DEVICES environment variable is set to all in the official CUDA images (nvcr.io/nvidia/cuda). So if your container image builds off the CUDA image, then this envvar will be set.

I am closing this issue since you have indicated your problem has been solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants