Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shiftfs error on /usr/bin/nvidia-persistenced with NVIDIA GPU Operator #111

Open
spikecurtis opened this issue Nov 26, 2024 · 5 comments
Open
Assignees
Labels
bug Something isn't working must-do

Comments

@spikecurtis
Copy link

{"output":"Failed to run envbox: start container: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: failed to request shiftfs marking to sysbox-mgr: failed to invoke ReqShiftfsMark via grpc: rpc error: code = Unknown desc = failed to mark shiftfs on /usr/bin/nvidia-persistenced at /var/lib/sysbox/shiftfs/5b2aad2e-4651-49bd-aed9-40665640bc00: not a directory: unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type","time":"2024-11-25T16:47:05.945875989Z","type":"error"}      

Customer reporting this is using the Helm installed https://helm.ngc.nvidia.com/nvidia "gpu-operator" at version v24.6.1

They are using ghcr.io/coder/envbox:latest as of last week.

@spikecurtis spikecurtis added the bug Something isn't working label Nov 26, 2024
@bpmct
Copy link
Member

bpmct commented Dec 12, 2024

@deansheather @johnstcn do either of you have bandwidth to reproduce and see if we can fix this one for this sprint? we're pulling into this sprint as there is a customer request.

@bpmct bpmct added the must-do label Dec 12, 2024
@johnstcn
Copy link
Member

johnstcn commented Dec 13, 2024

@bpmct I'll take a look at this, will need to gather some more information first to reproduce.

@johnstcn johnstcn self-assigned this Dec 13, 2024
@johnstcn
Copy link
Member

johnstcn commented Dec 13, 2024

Took a stab at this on GKE (kernel 5.15.0-1067, Ubuntu 22.04.5, NVIDIA Driver version 550.90.12, operator version v24.6.1, idmapped mounts disabled), but couldn't reproduce this exact issue with the same version of the operator.

I did find some other issues we'll want to address, but can't be sure they're directly related to the original issue:

  1. We appear to ignore libraries that don't end exactly with .so:
lrwxrwxrwx 1 root root       26 Dec 13 21:51 libEGL_nvidia.so.0 -> libEGL_nvidia.so.550.90.12
-rw-r--r-- 1 root root  1341568 Aug 29 05:30 libEGL_nvidia.so.550.90.12

would end up becoming this:

-rw-r--r-- 1 nobody nogroup  1341568 Aug 29 05:30 libEGL_nvidia.so.550.90.12

So you'd be missing the libEGL_nvidia.so.0 target in the above example where applications may look for it under that name.

  1. When mounting in the external symlinks we assume that the destination prefix in the inner container is /usr/lib but I'm seeing the operator add the libs under /usr/lib64:
/dev/sda1 on /usr/lib64/libEGL_nvidia.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libGLESv1_CM_nvidia.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libGLESv2_nvidia.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libGLX_nvidia.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libcuda.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libcudadebugger.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-allocator.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-cfg.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-egl-gbm.so.1.1.1 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-eglcore.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-glcore.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-glsi.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-glvkspirv.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-gpucomp.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-ml.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-ngx.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-nvvm.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-opencl.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-pkcs11-openssl3.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-pkcs11.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-ptxjitcompiler.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-rtcore.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvidia-tls.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/libnvoptix.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)
/dev/sda1 on /usr/lib64/nvidia/xorg/libglxserver_nvidia.so.550.90.12 type ext4 (ro,nosuid,nodev,relatime,discard,errors=remount-ro)

I figure we'd want to use the same dest prefix as flags.usrLibDir.

@johnstcn
Copy link
Member

Got some more details, will attempt a bare metal repro.

@johnstcn
Copy link
Member

After some clarification, no need to attempt bare metal repro. We'll still want to address the above issues found on GKE, either through documentation or code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working must-do
Projects
None yet
Development

No branches or pull requests

3 participants