Configuring docker in dagobah #116

zxcwangcui · 2022-04-06T02:15:10Z

The experimental environment is required and needs to be configured on dagobah

gqqnbig · 2022-04-08T15:37:49Z

no objection

gqqnbig · 2022-04-10T06:47:00Z

Enabled rootless docker on the system side.

You will enable rootless docker on your personal side. Follow instructions here https://docs.docker.com/engine/security/rootless/

gqqnbig · 2022-04-10T07:25:37Z

export XDG_RUNTIME_DIR=/home/wangcui/.docker/run
export PATH=/usr/bin:$PATH
export DOCKER_HOST=unix:///home/wangcui/.docker/run/docker.sock
rm -rf $XDG_RUNTIME_DIR
mkdir -p $XDG_RUNTIME_DIR

PATH=/usr/bin:/sbin:/usr/sbin:$PATH dockerd-rootless.sh >/dev/null 2>&1 &
docker run hello-world
docker run -it ubuntu bash

gqqnbig · 2022-04-10T07:49:18Z

你检查一下有没有问题。没有问题的话回复我。我们需要把安装步骤需要永久化，防止像 NVIDIA/nvidia-docker#22 一样装是装过了，过几个月又没了。

zxcwangcui · 2022-04-10T11:42:09Z

rm -rf $XDG_RUNTIME_DIR
mkdir -p $XDG_RUNTIME_DIR

gqqnbig · 2022-04-11T02:45:57Z

第一次运行

loginctl enable-linger $USER
dockerd-rootless-setuptool.sh install

dockerd-rootless-setuptool.sh install运行成功的话会输出两句export PATH=???和export DOCKER_HOST=???的话，记住这两句话，后面要用。

每次运行

loginctl enable-linger $USER
export PATH=???
export DOCKER_HOST=???

zxcwangcui · 2022-04-13T04:30:42Z

docker的gpu不可用。

在容器中使用gpu，出现了问题，复现如下。首先进入dagobah服务器，然后查看gpu的使用情况，目前有一块显卡可以使用。问题出来了：在新容器中运行docker没办法使用gpu

sit -g
nvidia-smi
docker run --gpus all  hello-world
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled  ```

gqqnbig · 2022-04-13T08:55:19Z

Admins

pay attention to

First you need to install nvidia-container-runtime. Visit Specify a container’s resources for more information.

in https://docs.docker.com/engine/reference/commandline/run/#access-an-nvidia-gpu.

Looks like we have to see https://github.com/NVIDIA/nvidia-container-runtime#ubuntu-distributions

sudo apt-get install nvidia-container-runtime

set cgroup=false moby/moby#38729 (comment)

then restart (personal) docker.

Users

Please try

docker run -it --rm --gpus "\"device=$CUDA_VISIBLE_DEVICES\"" ubuntu nvidia-smi

zxcwangcui · 2022-04-14T08:35:12Z

sbatch文件应用docker环境
#!/bin/bash
#---------------------------------------

设置基础的节点信息

#---------------------------------------
#SBATCH --job-name=trackformer
#SBATCH --nodes=1
#SBATCH --output=trackformer.out
#SBATCH --gres=gpu:1
#SBATCH --nice
#SBATCH --partition=normal
#SBATCH --nodelist=dagobah
#SBATCH --time=3-23:59:59

#—————————————————————

第一次启动服务器，输入自己文件路径信息

#—————————————————————
loginctl enable-linger $USER
export PATH=/usr/bin:$PATH
export DOCKER_HOST=unix:///run/user/5004/docker.sock

#—————————————————————

首先开始服务器，然后执行程序。

#—————————————————————
docker start trackformer_container
docker container exec -w /shared_disk/trackformer/ trackformer_container /root/miniconda3/envs/trackformer/bin/python /shared_disk/trackformer/src/track.py with reid obj_detect_checkpoint_file=models/mots20_train_masks/checkpoint.pth

gqqnbig · 2022-04-16T13:43:47Z

change cgroup to v2 https://docs.docker.com/config/containers/runmetrics/#changing-cgroup-version

gqqnbig · 2022-05-09T14:43:33Z

slurm不支持cgroup v2。

$ sudo slurmd -D
slurmd: Message aggregation disabled
slurmd: error: Unable to resolve "aloha": Host name lookup failure
slurmd: error: unable to mount cpuset cgroup namespace: Device or resource busy
slurmd: error: task/cgroup: unable to create cpuset namespace
slurmd: error: Couldn't load specified plugin name for task/cgroup: Plugin init() callback failed
slurmd: error: cannot create task context for task/cgroup
slurmd: error: slurmd initialization failed
$ grep cgroup /proc/filesystems
nodev   cgroup
nodev   cgroup2

gqqnbig · 2022-05-09T17:06:22Z

rootless docker本体可以安装。但如果要加gpu support，必须绕过cgroup v1或使用cgroup v2。

slurm必须使用cgroup v1。

目前，在不考虑slurm的情况下，我没试验出cgroup v2 + rootless docker + gpu的方法。其他人也不行 NVIDIA/nvidia-container-toolkit#189

如果保持cgroup v1，在slurm任务里面用docker，docker bypasses cgroup，即docker可以突破slurm限制，使用任意的GPU。这导致明显的问题：用户申请1个GPU，但docker里面可以看到10个GPU，用户不知道可以用哪个。我不知道docker是不是也一起突破了CPU限制。

一个自觉的方法是使用环境变量CUDA_VISIBLE_DEVICES。slurm会设置这个变量。

若使用docker run -it --rm --gpus "\"device=$CUDA_VISIBLE_DEVICES\"" ubuntu nvidia-smi，则docker只会看到应该看到的GPU。

这个方法的弱点也很明显。其中之一是，如果不是用docker run启动docker的那怎么办？

选项

I'm fine with rootless docker, but the path to gpu support is clouded.

不支持docker
仅支持到rootless docker，不可用GPU。
rootless docker+ bypass cgroup。靠用户自觉不占用多余GPU。我认为这会增加support request，尤其是，用户1占用了用户2的GPU，用户2的程序报错out of cuda memory，用户2来询问为什么他的程序挂掉了。
rootless docker+ bypass cgroup + 封装好的指令。封装一些指令，如在aha运行sit-docker，直接进入docker内cli。如在aha运行sbatch-docker，直接运行一个docker image。<--好是好，谁有空？
rootful docker。rootful docker就是传统的docker，传统docker本身就有root权限，即docker内是root，相当于docker外是root。rootful docker使用应该GPU没问题。

Lu-233 · 2022-05-10T01:39:44Z

1. 对部分用户是不可接受的。

3 看起来可行，但要对使用docker的用户进行充分的告知，我认为类似的情况已经发生过了。

我在dagobah 上两个任务中的 $CUDA_VISIBLE_DEVICES 都是 0

gqqnbig · 2022-05-10T03:17:27Z

我在dagobah 上两个任务中的 $CUDA_VISIBLE_DEVICES 都是 0

我用的是srun --pty -w dagobah --gres=gpu:2 $SHELL -i。不过仔细看CUDA_VISIBLE_DEVICES也是不行。因为不同job的CUDA_VISIBLE_DEVICES都是从0开始的。

gqqnbig · 2022-05-10T03:18:25Z

如果决定不支持docker，鉴于 NVIDIA/nvidia-docker#22 已经失败过一次，我建议最近3年都不要touch docker。

wulamao · 2022-05-10T03:49:13Z

rootless docker本体可以安装。但如果要加gpu support，必须绕过cgroup v1或使用cgroup v2。

slurm必须使用cgroup v1。

目前，在不考虑slurm的情况下，我没试验出cgroup v2 + rootless docker + gpu的方法。其他人也不行 NVIDIA/nvidia-container-toolkit#189 (comment)

如果保持cgroup v1，在slurm任务里面用docker，docker bypasses cgroup，即docker可以突破slurm限制，使用任意的GPU。这导致明显的问题：用户申请1个GPU，但docker里面可以看到10个GPU，用户不知道可以用哪个。我不知道docker是不是也一起突破了CPU限制。

一个自觉的方法是使用环境变量CUDA_VISIBLE_DEVICES。slurm会设置这个变量。

若使用docker run -it --rm --gpus "\"device=$CUDA_VISIBLE_DEVICES\"" ubuntu nvidia-smi，则docker只会看到应该看到的GPU。

这个方法的弱点也很明显。其中之一是，如果不是用docker run启动docker的那怎么办？

选项

I'm fine with rootless docker, but the path to gpu support is clouded.

不支持docker

仅支持到rootless docker，不可用GPU。

rootless docker+ bypass cgroup。靠用户自觉不占用多余GPU。我认为这会增加support request，尤其是，用户1占用了用户2的GPU，用户2的程序报错out of cuda memory，用户2来询问为什么他的程序挂掉了。

rootless docker+ bypass cgroup + 封装好的指令。封装一些指令，如在aha运行sit-docker，直接进入docker内cli。如在aha运行sbatch-docker，直接运行一个docker image。<--好是好，谁有空？

rootful docker。rootful docker就是传统的docker，传统docker本身就有root权限，即docker内是root，相当于docker外是root。rootful docker使用应该GPU没问题。

根据现有的用户的使用情况来看，我们暂时将用户使用docker的目的归纳如下：
在安全情况下，使用su方式安装不同版本的编译环境，满足包括c++编译以及cuda开发的编译需求。
考虑到维护工作的具体情况，上述1-5方案中，2选项可能是目前的优选方案，即仅使用docker作为编译目的而不是训练模型。
我根据个人有限的使用经验来看，2方案在大多数情况下是可行的，该结论仅供参考，请各位发表自己的评论。

gqqnbig · 2022-05-10T04:52:28Z

你能否测试一下rootless docker不加GPU能否帮到你的工作？

gqqnbig · 2022-08-25T08:47:33Z

Not directly related, but rootful docker can crash the whole system due to bugs in what the docker runs. The crash requires server room access.

gqqnbig · 2022-08-26T06:56:30Z

going to close on 9/26

gqqnbig self-assigned this Apr 10, 2022

gqqnbig assigned zxcwangcui and unassigned gqqnbig Apr 10, 2022

gqqnbig assigned gqqnbig and unassigned zxcwangcui Apr 11, 2022

gqqnbig assigned zxcwangcui and unassigned gqqnbig Apr 11, 2022

zxcwangcui removed their assignment Apr 13, 2022

gqqnbig self-assigned this Apr 13, 2022

gqqnbig assigned wulamao and unassigned gqqnbig May 9, 2022

wulamao assigned gqqnbig and unassigned wulamao May 10, 2022

gqqnbig assigned zxcwangcui and unassigned gqqnbig May 10, 2022

gqqnbig mentioned this issue Sep 3, 2022

Try to implement self-hosted runners #86

Closed

gqqnbig closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuring docker in dagobah #116

Configuring docker in dagobah #116

zxcwangcui commented Apr 6, 2022

gqqnbig commented Apr 8, 2022

gqqnbig commented Apr 10, 2022

gqqnbig commented Apr 10, 2022 •

edited

Loading

gqqnbig commented Apr 10, 2022 •

edited

Loading

zxcwangcui commented Apr 10, 2022

gqqnbig commented Apr 11, 2022 •

edited

Loading

zxcwangcui commented Apr 13, 2022 •

edited

Loading

gqqnbig commented Apr 13, 2022 •

edited

Loading

zxcwangcui commented Apr 14, 2022

gqqnbig commented Apr 16, 2022

gqqnbig commented May 9, 2022

gqqnbig commented May 9, 2022

Lu-233 commented May 10, 2022

gqqnbig commented May 10, 2022

gqqnbig commented May 10, 2022

wulamao commented May 10, 2022

选项

gqqnbig commented May 10, 2022

gqqnbig commented Aug 25, 2022

gqqnbig commented Aug 26, 2022

Configuring docker in dagobah #116

Configuring docker in dagobah #116

Comments

zxcwangcui commented Apr 6, 2022

gqqnbig commented Apr 8, 2022

gqqnbig commented Apr 10, 2022

gqqnbig commented Apr 10, 2022 • edited Loading

gqqnbig commented Apr 10, 2022 • edited Loading

zxcwangcui commented Apr 10, 2022

gqqnbig commented Apr 11, 2022 • edited Loading

第一次运行

每次运行

zxcwangcui commented Apr 13, 2022 • edited Loading

gqqnbig commented Apr 13, 2022 • edited Loading

Admins

Users

zxcwangcui commented Apr 14, 2022

设置基础的节点信息

第一次启动服务器，输入自己文件路径信息

首先开始服务器，然后执行程序。

gqqnbig commented Apr 16, 2022

gqqnbig commented May 9, 2022

gqqnbig commented May 9, 2022

选项

Lu-233 commented May 10, 2022

gqqnbig commented May 10, 2022

gqqnbig commented May 10, 2022

wulamao commented May 10, 2022

选项

gqqnbig commented May 10, 2022

gqqnbig commented Aug 25, 2022

gqqnbig commented Aug 26, 2022

gqqnbig commented Apr 10, 2022 •

edited

Loading

gqqnbig commented Apr 10, 2022 •

edited

Loading

gqqnbig commented Apr 11, 2022 •

edited

Loading

zxcwangcui commented Apr 13, 2022 •

edited

Loading

gqqnbig commented Apr 13, 2022 •

edited

Loading