GPU Support not working #28

alexcpn · 2021-05-17T14:17:53Z

nvidia container runtime is installed

alexpunnen@pop-os:~$ sudo apt install nvidia-container-runtime
Reading package lists... Done
Building dependency tree       
Reading state information... Done
nvidia-container-runtime is already the newest version (3.4.0-1pop1~1601325114~20.04~2880fc6).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.

Error

alexpunnen@pop-os:~$ tensorman run --gpu python -- ./script.py
"docker" "run" "-u" "1000:1000" "--gpus=all" "-e" "HOME=/project" "-it" "--rm" "-v" "/home/alexpunnen:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu" "python" "./script.py"
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Driver is installed

alexpunnen@pop-os:~$ nvidia-smi
Mon May 17 19:46:54 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce MX250       Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   55C    P0    N/A /  N/A |    260MiB /  2002MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       900      G   /usr/lib/xorg/Xorg                 45MiB |
|    0   N/A  N/A     14737      G   /usr/lib/xorg/Xorg                141MiB |
|    0   N/A  N/A     14867      G   /usr/bin/gnome-shell               24MiB |
|    0   N/A  N/A     18812      G   ...AAAAAAAAA= --shared-files       40MiB |
+-----------------------------------------------------------------------------+

Should we install nvidia-docker2 ? I was not able to install it

alexpunnen@pop-os:~$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
>   sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# Unsupported distribution!
# Check https://nvidia.github.io/nvidia-docker

alexpunnen@pop-os:~$ sudo apt-get install nvidia-docker2
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package nvidia-docker2

alexpunnen@pop-os:~$ cat /etc/os-release
NAME="Pop!_OS"
VERSION="20.04 LTS"
ID=pop
ID_LIKE="ubuntu debian"
PRETTY_NAME="Pop!_OS 20.04 LTS"
VERSION_ID="20.04"
HOME_URL="https://pop.system76.com"
SUPPORT_URL="https://support.system76.com"
BUG_REPORT_URL="https://github.com/pop-os/pop/issues"
PRIVACY_POLICY_URL="https://system76.com/privacy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
LOGO=distributor-logo-pop-os

The text was updated successfully, but these errors were encountered:

alexcpn · 2021-05-17T16:27:25Z

I gave distribution=ubuntu20.04 and tried to install but

distribution=ubuntu20.04
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
>   sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get install nvidia-docker2
... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-docker2 : Depends: nvidia-container-runtime (>= 3.5.0) but 3.4.0-1pop1~1601325114~20.04~2880fc6 is to be installed
E: Unable to correct problems, you have held broken packages.

looks like NVIDIA/nvidia-docker#1388 (comment); I followed the WA given here --> NVIDIA/nvidia-docker#1388 (comment) and able to install nvidia-docker2
and at least cuda in docker is working. But not tensorman still gives same error

sudo docker run --rm --runtime=nvidia -ti nvidia/cuda:11.3.0-base-ubuntu20.04
Unable to find image 'nvidia/cuda:11.3.0-base-ubuntu20.04' locally
11.3.0-base-ubuntu20.04: Pulling from nvidia/cuda
a70d879fa598: Pull complete 
c4394a92d1f8: Pull complete 
10e6159c56c0: Pull complete 
f1ff119ac131: Pull complete 
3e2dbc551fee: Pull complete 
4f57fe919a49: Pull complete 
216bbbf373ef: Pull complete 
Digest: sha256:7939995fc912a21e62be16c866b62e14d383ef16ed288f1d17268ba0b7226574
Status: Downloaded newer image for nvidia/cuda:11.3.0-base-ubuntu20.04
root@159db2b73ded:/# nvidia-smi
Mon May 17 17:07:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce MX250       Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   54C    P0    N/A /  N/A |    292MiB /  2002MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

alexpunnen@pop-os:~$ sudo tensorman run --gpu bash
"docker" "run" "-u" "0:0" "--gpus=all" "-e" "HOME=/project" "-it" "--rm" "-v" "/home/alexpunnen:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu" "bash"
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Aarivvk · 2022-01-02T11:45:09Z

This issue is keeping me away from upgrading to 21.10.
Kindly fix the issue or give us the work around until then.

Aarivvk · 2022-01-03T10:28:54Z

NVIDIA/nvidia-docker#1447 (comment)
looks like this could solve the issue.
I'll try and let you know.

Found your comment there as well 👍 @alexcpn
NVIDIA/nvidia-docker#1447 (comment)

Aarivvk · 2022-01-03T12:05:37Z

Below solution works for me.
sudo gedit /etc/default/grub
Append the systemd.unified_cgroup_hierarchy=0 at the end of the "GRUB_CMDLINE_LINUX_DEFAULT", something like below.
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash systemd.unified_cgroup_hierarchy=0"
save the file, run the grub update.
sudo update-grub
finally reboot and your docker should work with gpu's.

NOTE1: making this changes to "cgroup" should be handheld with care. May be apps depends cgroup may not work. More info here"https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/ch01"
NOTE2: There are other ways to fix this, you can find the above links provided where the docker config and sharing few devices to the docker solves the issue, this is simple fix.
NOTE3: I personally would prefers the NOTE2.

mmstick · 2022-01-03T19:44:51Z

I'm working on it

mmstick · 2022-01-05T18:09:42Z

Fixed, but will still require systemd.unified_cgroup_hierarchy=0 to be added as a kernel option. On EFI systems, sudo kernelstub -a systemd.unified_cgroup_hierarchy=0. At least until NVIDIA release v1.8.0 of their container runtime tools.

Updates will be available on Impish soon, with a new nvidia-docker2 package that replaces nvidia-container-runtime.

mmstick closed this as completed Jan 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Support not working #28

GPU Support not working #28

alexcpn commented May 17, 2021 •

edited

Loading

alexcpn commented May 17, 2021 •

edited

Loading

Aarivvk commented Jan 2, 2022

Aarivvk commented Jan 3, 2022 •

edited

Loading

Aarivvk commented Jan 3, 2022

mmstick commented Jan 3, 2022

mmstick commented Jan 5, 2022

GPU Support not working #28

GPU Support not working #28

Comments

alexcpn commented May 17, 2021 • edited Loading

alexcpn commented May 17, 2021 • edited Loading

Aarivvk commented Jan 2, 2022

Aarivvk commented Jan 3, 2022 • edited Loading

Aarivvk commented Jan 3, 2022

mmstick commented Jan 3, 2022

mmstick commented Jan 5, 2022

alexcpn commented May 17, 2021 •

edited

Loading

alexcpn commented May 17, 2021 •

edited

Loading

Aarivvk commented Jan 3, 2022 •

edited

Loading