Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Support not working #28

Closed
alexcpn opened this issue May 17, 2021 · 6 comments
Closed

GPU Support not working #28

alexcpn opened this issue May 17, 2021 · 6 comments

Comments

@alexcpn
Copy link

alexcpn commented May 17, 2021

nvidia container runtime is installed

alexpunnen@pop-os:~$ sudo apt install nvidia-container-runtime
Reading package lists... Done
Building dependency tree       
Reading state information... Done
nvidia-container-runtime is already the newest version (3.4.0-1pop1~1601325114~20.04~2880fc6).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.

Error

alexpunnen@pop-os:~$ tensorman run --gpu python -- ./script.py
"docker" "run" "-u" "1000:1000" "--gpus=all" "-e" "HOME=/project" "-it" "--rm" "-v" "/home/alexpunnen:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu" "python" "./script.py"
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Driver is installed

alexpunnen@pop-os:~$ nvidia-smi
Mon May 17 19:46:54 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce MX250       Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   55C    P0    N/A /  N/A |    260MiB /  2002MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       900      G   /usr/lib/xorg/Xorg                 45MiB |
|    0   N/A  N/A     14737      G   /usr/lib/xorg/Xorg                141MiB |
|    0   N/A  N/A     14867      G   /usr/bin/gnome-shell               24MiB |
|    0   N/A  N/A     18812      G   ...AAAAAAAAA= --shared-files       40MiB |
+-----------------------------------------------------------------------------+

Should we install nvidia-docker2 ? I was not able to install it

alexpunnen@pop-os:~$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
>   sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# Unsupported distribution!
# Check https://nvidia.github.io/nvidia-docker
alexpunnen@pop-os:~$ sudo apt-get install nvidia-docker2
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package nvidia-docker2

alexpunnen@pop-os:~$ cat /etc/os-release
NAME="Pop!_OS"
VERSION="20.04 LTS"
ID=pop
ID_LIKE="ubuntu debian"
PRETTY_NAME="Pop!_OS 20.04 LTS"
VERSION_ID="20.04"
HOME_URL="https://pop.system76.com"
SUPPORT_URL="https://support.system76.com"
BUG_REPORT_URL="https://github.com/pop-os/pop/issues"
PRIVACY_POLICY_URL="https://system76.com/privacy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
LOGO=distributor-logo-pop-os
@alexcpn
Copy link
Author

alexcpn commented May 17, 2021

I gave distribution=ubuntu20.04 and tried to install but

distribution=ubuntu20.04
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
>   sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get install nvidia-docker2
... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-docker2 : Depends: nvidia-container-runtime (>= 3.5.0) but 3.4.0-1pop1~1601325114~20.04~2880fc6 is to be installed
E: Unable to correct problems, you have held broken packages.

looks like NVIDIA/nvidia-docker#1388 (comment); I followed the WA given here --> NVIDIA/nvidia-docker#1388 (comment) and able to install nvidia-docker2
and at least cuda in docker is working. But not tensorman still gives same error

sudo docker run --rm --runtime=nvidia -ti nvidia/cuda:11.3.0-base-ubuntu20.04
Unable to find image 'nvidia/cuda:11.3.0-base-ubuntu20.04' locally
11.3.0-base-ubuntu20.04: Pulling from nvidia/cuda
a70d879fa598: Pull complete 
c4394a92d1f8: Pull complete 
10e6159c56c0: Pull complete 
f1ff119ac131: Pull complete 
3e2dbc551fee: Pull complete 
4f57fe919a49: Pull complete 
216bbbf373ef: Pull complete 
Digest: sha256:7939995fc912a21e62be16c866b62e14d383ef16ed288f1d17268ba0b7226574
Status: Downloaded newer image for nvidia/cuda:11.3.0-base-ubuntu20.04
root@159db2b73ded:/# nvidia-smi
Mon May 17 17:07:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce MX250       Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   54C    P0    N/A /  N/A |    292MiB /  2002MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
alexpunnen@pop-os:~$ sudo tensorman run --gpu bash
"docker" "run" "-u" "0:0" "--gpus=all" "-e" "HOME=/project" "-it" "--rm" "-v" "/home/alexpunnen:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu" "bash"
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

@Aarivvk
Copy link

Aarivvk commented Jan 2, 2022

This issue is keeping me away from upgrading to 21.10.
Kindly fix the issue or give us the work around until then.

@Aarivvk
Copy link

Aarivvk commented Jan 3, 2022

NVIDIA/nvidia-docker#1447 (comment)
looks like this could solve the issue.
I'll try and let you know.

Found your comment there as well 👍 @alexcpn
NVIDIA/nvidia-docker#1447 (comment)

@Aarivvk
Copy link

Aarivvk commented Jan 3, 2022

Below solution works for me.
sudo gedit /etc/default/grub
Append the systemd.unified_cgroup_hierarchy=0 at the end of the "GRUB_CMDLINE_LINUX_DEFAULT", something like below.
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash systemd.unified_cgroup_hierarchy=0"
save the file, run the grub update.
sudo update-grub
finally reboot and your docker should work with gpu's.

NOTE1: making this changes to "cgroup" should be handheld with care. May be apps depends cgroup may not work. More info here"https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/ch01"
NOTE2: There are other ways to fix this, you can find the above links provided where the docker config and sharing few devices to the docker solves the issue, this is simple fix.
NOTE3: I personally would prefers the NOTE2.

@mmstick
Copy link
Member

mmstick commented Jan 3, 2022

I'm working on it

@mmstick
Copy link
Member

mmstick commented Jan 5, 2022

Fixed, but will still require systemd.unified_cgroup_hierarchy=0 to be added as a kernel option. On EFI systems, sudo kernelstub -a systemd.unified_cgroup_hierarchy=0. At least until NVIDIA release v1.8.0 of their container runtime tools.

Updates will be available on Impish soon, with a new nvidia-docker2 package that replaces nvidia-container-runtime.

@mmstick mmstick closed this as completed Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants