Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nesting Containers creates issues with CUDA/NVIDIA toolkit #1205

Closed
briedel opened this issue Mar 14, 2023 · 1 comment · Fixed by #1220
Closed

Nesting Containers creates issues with CUDA/NVIDIA toolkit #1205

briedel opened this issue Mar 14, 2023 · 1 comment · Fixed by #1220
Assignees
Milestone

Comments

@briedel
Copy link
Contributor

briedel commented Mar 14, 2023

Version of Apptainer

What version of Apptainer (or Singularity) are you using? Run

Apptainer> apptainer version
1.1.6-1.el8

Expected behavior

When running nested containers I would expect that I can access the GPUs in the nested container, i.e. GPUs are passed through from Host to Container 1 to Container 2

Actual behavior

When running nvidia-smi in Container 2 you get following error:

Apptainer> nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

This is caused by differences in the /.singularity/libs between Container 1 and Container 2:

Container 1:

Apptainer> ls -l /.singularity.d/libs/
total 505260
-rwxr-xr-x 1 nobody nobody     85288 Nov 15  2020 libEGL.so.1
-rwxr-xr-x 1 nobody nobody   1329168 Jul 20  2022 libEGL_nvidia.so.0
-rwxr-xr-x 1 nobody nobody    559520 Nov 15  2020 libGL.so.1
-rwxr-xr-x 1 nobody nobody     43216 Nov 15  2020 libGLESv1_CM.so.1
-rwxr-xr-x 1 nobody nobody     67880 Jul 20  2022 libGLESv1_CM_nvidia.so.1
-rwxr-xr-x 1 nobody nobody     72720 Nov 15  2020 libGLESv2.so.2
-rwxr-xr-x 1 nobody nobody    117032 Jul 20  2022 libGLESv2_nvidia.so.2
-rwxr-xr-x 1 nobody nobody    141824 Nov 15  2020 libGLX.so.0
-rwxr-xr-x 1 nobody nobody   1289552 Jul 20  2022 libGLX_nvidia.so.0
-rwxr-xr-x 1 nobody nobody    769056 Nov 15  2020 libGLdispatch.so.0
-rwxr-xr-x 1 nobody nobody     30856 Jun  8  2022 libOpenCL.so.1
-rwxr-xr-x 1 nobody nobody    178608 Nov 15  2020 libOpenGL.so.0
-rwxr-xr-x 1 nobody nobody  20988000 Jul 20  2022 libcuda.so
-rwxr-xr-x 1 nobody nobody  20988000 Jul 20  2022 libcuda.so.1
-rwxr-xr-x 1 nobody nobody   5253928 Jul 20  2022 libnvcuvid.so
-rwxr-xr-x 1 nobody nobody   5253928 Jul 20  2022 libnvcuvid.so.1
-rwxr-xr-x 1 nobody nobody    246128 Jul 20  2022 libnvidia-cfg.so
-rwxr-xr-x 1 nobody nobody    246128 Jul 20  2022 libnvidia-cfg.so.1
-rwxr-xr-x 1 nobody nobody  56227320 Jul 20  2022 libnvidia-compiler.so.515.65.01
-rwxr-xr-x 1 nobody nobody     50304 Oct 27  2020 libnvidia-egl-wayland.so.1
-rwxr-xr-x 1 nobody nobody  33855376 Jul 20  2022 libnvidia-eglcore.so.515.65.01
-rwxr-xr-x 1 nobody nobody    116768 Jul 20  2022 libnvidia-encode.so
-rwxr-xr-x 1 nobody nobody    116768 Jul 20  2022 libnvidia-encode.so.1
-rwxr-xr-x 1 nobody nobody    133816 Jul 20  2022 libnvidia-fbc.so
-rwxr-xr-x 1 nobody nobody    133816 Jul 20  2022 libnvidia-fbc.so.1
-rwxr-xr-x 1 nobody nobody  36767184 Jul 20  2022 libnvidia-glcore.so.515.65.01
-rwxr-xr-x 1 nobody nobody    643944 Jul 20  2022 libnvidia-glsi.so.515.65.01
-rwxr-xr-x 1 nobody nobody  16704184 Jul 20  2022 libnvidia-glvkspirv.so.515.65.01
-rwxr-xr-x 1 nobody nobody   1658976 Jul 21  2022 libnvidia-gtk3.so.515.65.01
-rwxr-xr-x 1 nobody nobody   1683960 Jul 20  2022 libnvidia-ml.so
-rwxr-xr-x 1 nobody nobody   1683960 Jul 20  2022 libnvidia-ml.so.1
-rwxr-xr-x 1 nobody nobody  16310984 Jul 20  2022 libnvidia-opencl.so.1
-rwxr-xr-x 1 nobody nobody     47088 Jul 20  2022 libnvidia-opticalflow.so.1
-rwxr-xr-x 1 nobody nobody  11488760 Jul 20  2022 libnvidia-ptxjitcompiler.so
-rwxr-xr-x 1 nobody nobody  11488760 Jul 20  2022 libnvidia-ptxjitcompiler.so.1
-rwxr-xr-x 1 nobody nobody  81405472 Jul 20  2022 libnvidia-rtcore.so.515.65.01
-rwxr-xr-x 1 nobody nobody     18456 Jul 20  2022 libnvidia-tls.so.515.65.01
-rwxr-xr-x 1 nobody nobody 189105240 Jul 20  2022 libnvoptix.so.1


Container 2:
Apptainer> ls -l /.singularity.d/libs/
total 466268
-rwxr-xr-x 1 nobody nobody     85288 Nov 15  2020 libEGL.so.1
-rwxr-xr-x 1 nobody nobody   1329168 Jul 20  2022 libEGL_nvidia.so.0
-rwxr-xr-x 1 nobody nobody    559520 Nov 15  2020 libGL.so.1
-rwxr-xr-x 1 nobody nobody     43216 Nov 15  2020 libGLESv1_CM.so.1
-rwxr-xr-x 1 nobody nobody     67880 Jul 20  2022 libGLESv1_CM_nvidia.so.1
-rwxr-xr-x 1 nobody nobody     72720 Nov 15  2020 libGLESv2.so.2
-rwxr-xr-x 1 nobody nobody    117032 Jul 20  2022 libGLESv2_nvidia.so.2
-rwxr-xr-x 1 nobody nobody    141824 Nov 15  2020 libGLX.so.0
-rwxr-xr-x 1 nobody nobody   1289552 Jul 20  2022 libGLX_nvidia.so.0
-rwxr-xr-x 1 nobody nobody    769056 Nov 15  2020 libGLdispatch.so.0
-rwxr-xr-x 1 nobody nobody     30856 Jun  8  2022 libOpenCL.so.1
-rwxr-xr-x 1 nobody nobody    178608 Nov 15  2020 libOpenGL.so.0
-rwxr-xr-x 1 nobody nobody  20988000 Jul 20  2022 libcuda.so
-rwxr-xr-x 1 nobody nobody   5253928 Jul 20  2022 libnvcuvid.so
-rwxr-xr-x 1 nobody nobody    246128 Jul 20  2022 libnvidia-cfg.so
-rwxr-xr-x 1 nobody nobody  56227320 Jul 20  2022 libnvidia-compiler.so.515.65.01
-rwxr-xr-x 1 nobody nobody     50304 Oct 27  2020 libnvidia-egl-wayland.so.1
-rwxr-xr-x 1 nobody nobody  33855376 Jul 20  2022 libnvidia-eglcore.so.515.65.01
-rwxr-xr-x 1 nobody nobody    116768 Jul 20  2022 libnvidia-encode.so
-rwxr-xr-x 1 nobody nobody    133816 Jul 20  2022 libnvidia-fbc.so
-rwxr-xr-x 1 nobody nobody  36767184 Jul 20  2022 libnvidia-glcore.so.515.65.01
-rwxr-xr-x 1 nobody nobody    643944 Jul 20  2022 libnvidia-glsi.so.515.65.01
-rwxr-xr-x 1 nobody nobody  16704184 Jul 20  2022 libnvidia-glvkspirv.so.515.65.01
-rwxr-xr-x 1 nobody nobody   1658976 Jul 21  2022 libnvidia-gtk3.so.515.65.01
-rwxr-xr-x 1 nobody nobody   1683960 Jul 20  2022 libnvidia-ml.so
-rwxr-xr-x 1 nobody nobody  16310984 Jul 20  2022 libnvidia-opencl.so.1
-rwxr-xr-x 1 nobody nobody     47088 Jul 20  2022 libnvidia-opticalflow.so.1
-rwxr-xr-x 1 nobody nobody  11488760 Jul 20  2022 libnvidia-ptxjitcompiler.so
-rwxr-xr-x 1 nobody nobody  81405472 Jul 20  2022 libnvidia-rtcore.so.515.65.01
-rwxr-xr-x 1 nobody nobody     18456 Jul 20  2022 libnvidia-tls.so.515.65.01
-rwxr-xr-x 1 nobody nobody 189105240 Jul 20  2022 libnvoptix.so.1

Steps to reproduce this behavior

[riedel1@gpub019 ~]$ apptainer run --contain -B /tmp/glidein:/pilot  --bind /dev/fuse --nv --bind /etc/OpenCL/vendors /tmp/glidein/osgvo-pilot.sif /bin/bash
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (751) bind mounts
INFO:    underlay of /etc/OpenCL/vendors required more than 50 (153) bind mounts
No CVMFS repos requested, skipping cvmfsexec.
Apptainer> nvidia-smi
Tue Mar 14 14:11:35 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:46:00.0 Off |                    0 |
|  0%   27C    P8    30W / 300W |      2MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Apptainer> grep
Usage: grep [OPTION]... PATTERN [FILE]...
Try 'grep --help' for more information.
Apptainer> env | grep "APPTAINER"
APPTAINER_COMMAND=run
APPTAINER_CONTAINER=/tmp/glidein/osgvo-pilot.sif
APPTAINER_BIND=/pilot,/dev/fuse,/etc/OpenCL/vendors
APPTAINER_NAME=osgvo-pilot.sif
APPTAINER_APPNAME=
APPTAINER_ENVIRONMENT=/.singularity.d/env/91-environment.sh
Apptainer> apptainer run --contain --nv /pilot/osgvo-pilot.sif /bin/bash
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (751) bind mounts
INFO:    underlay of /etc/OpenCL/vendors required more than 50 (153) bind mounts
No CVMFS repos requested, skipping cvmfsexec.
Apptainer> nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

I fixed this by bind mounting the missing library into container 2:

Apptainer> apptainer run --contain --nv -B /.singularity.d/libs/libnvidia-ml.so.1 /pilot/osgvo-pilot.sif /bin/bash
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (751) bind mounts
INFO:    underlay of /etc/OpenCL/vendors required more than 50 (153) bind mounts
No CVMFS repos requested, skipping cvmfsexec.
Apptainer> nvidia-smi
Tue Mar 14 14:22:25 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:46:00.0 Off |                    0 |
|  0%   28C    P8    30W / 300W |      2MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

How did you install Apptainer

First apptainer comes from CVMFS (oasis.opensciencegrid.org/apptainer/mis/apptainer/bin) second comes from the RPM

@DrDaveD
Copy link
Contributor

DrDaveD commented Mar 14, 2023

I believe the problem is that the --nv option loses the symlink connection between .so files and their corresponding underlying versioned library, because it does a bind mount for the .so. I think that if it instead created a symlink for .so files in /.singularity.d/libs, it would solve this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants