-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configuring docker in dagobah #116
Comments
no objection |
Enabled rootless docker on the system side. You will enable rootless docker on your personal side. Follow instructions here https://docs.docker.com/engine/security/rootless/ |
export XDG_RUNTIME_DIR=/home/wangcui/.docker/run
export PATH=/usr/bin:$PATH
export DOCKER_HOST=unix:///home/wangcui/.docker/run/docker.sock
rm -rf $XDG_RUNTIME_DIR
mkdir -p $XDG_RUNTIME_DIR
PATH=/usr/bin:/sbin:/usr/sbin:$PATH dockerd-rootless.sh >/dev/null 2>&1 &
docker run hello-world
docker run -it ubuntu bash |
你检查一下有没有问题。没有问题的话回复我。我们需要把安装步骤需要永久化,防止像 NVIDIA/nvidia-docker#22 一样装是装过了,过几个月又没了。 |
rm -rf $XDG_RUNTIME_DIR |
第一次运行loginctl enable-linger $USER
dockerd-rootless-setuptool.sh install
每次运行loginctl enable-linger $USER
export PATH=???
export DOCKER_HOST=??? |
在容器中使用gpu,出现了问题,复现如下。首先进入dagobah服务器,然后查看gpu的使用情况,目前有一块显卡可以使用。问题出来了:在新容器中运行docker没办法使用gpu sit -g
nvidia-smi
docker run --gpus all hello-world
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled ``` |
Adminspay attention to
in https://docs.docker.com/engine/reference/commandline/run/#access-an-nvidia-gpu. Looks like we have to see https://github.com/NVIDIA/nvidia-container-runtime#ubuntu-distributions sudo apt-get install nvidia-container-runtime set cgroup=false moby/moby#38729 (comment) then restart (personal) docker. UsersPlease try
|
sbatch文件应用docker环境 设置基础的节点信息#--------------------------------------- #————————————————————— 第一次启动服务器,输入自己文件路径信息#————————————————————— #————————————————————— 首先开始服务器,然后执行程序。#————————————————————— |
change cgroup to v2 https://docs.docker.com/config/containers/runmetrics/#changing-cgroup-version |
slurm不支持cgroup v2。 $ sudo slurmd -D
slurmd: Message aggregation disabled
slurmd: error: Unable to resolve "aloha": Host name lookup failure
slurmd: error: unable to mount cpuset cgroup namespace: Device or resource busy
slurmd: error: task/cgroup: unable to create cpuset namespace
slurmd: error: Couldn't load specified plugin name for task/cgroup: Plugin init() callback failed
slurmd: error: cannot create task context for task/cgroup
slurmd: error: slurmd initialization failed
$ grep cgroup /proc/filesystems
nodev cgroup
nodev cgroup2 |
rootless docker本体可以安装。但如果要加gpu support,必须绕过cgroup v1或使用cgroup v2。 slurm必须使用cgroup v1。 目前,在不考虑slurm的情况下,我没试验出cgroup v2 + rootless docker + gpu的方法。其他人也不行 NVIDIA/nvidia-container-toolkit#189 如果保持cgroup v1,在slurm任务里面用docker,docker bypasses cgroup,即docker可以突破slurm限制,使用任意的GPU。这导致明显的问题:用户申请1个GPU,但docker里面可以看到10个GPU,用户不知道可以用哪个。我不知道docker是不是也一起突破了CPU限制。 一个自觉的方法是使用环境变量CUDA_VISIBLE_DEVICES。slurm会设置这个变量。 若使用 这个方法的弱点也很明显。其中之一是,如果不是用 选项I'm fine with rootless docker, but the path to gpu support is clouded.
|
3 看起来可行,但要对使用docker的用户进行充分的告知,我认为类似的情况已经发生过了。 我在dagobah 上两个任务中的 $CUDA_VISIBLE_DEVICES 都是 0 |
如果决定不支持docker,鉴于 NVIDIA/nvidia-docker#22 已经失败过一次,我建议最近3年都不要touch docker。 |
根据现有的用户的使用情况来看,我们暂时将用户使用docker的目的归纳如下: |
你能否测试一下rootless docker不加GPU能否帮到你的工作? |
Not directly related, but rootful docker can crash the whole system due to bugs in what the docker runs. The crash requires server room access. |
going to close on 9/26 |
The experimental environment is required and needs to be configured on dagobah
The text was updated successfully, but these errors were encountered: