Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue mounting memory cgroup in Docker #28

Closed
boucher opened this issue Sep 10, 2015 · 23 comments
Closed

Issue mounting memory cgroup in Docker #28

boucher opened this issue Sep 10, 2015 · 23 comments

Comments

@boucher
Copy link

boucher commented Sep 10, 2015

I'm passing along a bug reported to me via email. Let me know if you'd prefer sending this type of thing to the mailing list in the future.


My configuration:
Linux ismael 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3
(2015-08-04) x86_64 GNU/Linux

I am using the last release compiled version of Docker Experimental
v1.9.0 with the compiled version of CRIU v1.6 and libprotobuf:
https://github.com/boucher/docker/releases

I start the daemon:
sudo ./docker-1.9.0-dev daemon

then a container:
docker run -d --publish-service test busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'

the checkpoint goes well:
sudo ./docker-1.9.0-dev checkpoint --image-dir=/home/ismael/images --work-dir=/home/ismael/logs <container id>

but the restore part fail:
sudo ./docker-1.9.0-dev restore --image-dir=/home/ismael/images --work-dir=/home/ismael/logs <container id>

It output:
Error response from daemon: Cannot restore container 7cbe526dbef8d71c98ea7b365ebc1a12325ddc300216a6697e37831a8f56af44: criu failed: type RESTORE errno 0

dump log: https://gist.github.com/boucher/d6803d75a14606bd95fb
restore log: https://gist.github.com/boucher/ab66916846d27c73d361

@avagin
Copy link
Member

avagin commented Sep 10, 2015

I remember that the memory controller is disabled by default in Debian

@avagin
Copy link
Member

avagin commented Sep 10, 2015

This issue is reproduced if a kernel is booted with cgroup_disable=memory

@avagin
Copy link
Member

avagin commented Sep 10, 2015

CRIU parses /proc/cgroups to collect controllers
[root@fc22-vm criu]# cat /proc/self/cgroup | grep memory
[root@fc22-vm criu]# cat /proc/cgroups | grep memor
memory 0 1 0

Disabled cgroups show up in /proc/cgroups

@eliams
Copy link

eliams commented Sep 11, 2015

@boucher Thanks for the forward.
@avagin Thanks for the reply.
Before reading it, I switched to a 4.0.0 kernel and it worked. I checked and the cgroups are effectively enabled.
ismael@debian:~$ cat /proc/cgroups | grep memor
memory 5 63 1

However I have come to other issues, I will explain it here but tell me if I need to make a new post.
When I checkpoint a container, I can kill the daemon, restart it and restore the container.
Although it's only working the first time I kill the daemon, at the second try the restore part fails.

I'm working on: Linux debian 4.0.0+ #2 SMP Fri Sep 11 14:27:32 CEST 2015 x86_64 GNU/Linux
criu version: 1.7 build from sources
I use aufs as storage driver for the daemon
I start the container with:
./docker run -d busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'

First daemon logs until I kill it: http://pastebin.com/3zneGWDp
Second daemon logs, the container status is checkpointed. I restored the container then checkpointed it again: http://pastebin.com/hRqMJ6SU
the last daemon log, this time it decided to kill the container.
I don't know why and the container status is exited which lead the restore to fail: http://pastebin.com/gfrutcrV
Output of the restore command:
Error response from daemon: Container a740fc4a4d9f is not checkpointed
Error: failed to restore one or more containers

I don't have any restore logs since the daemon killed the container when it was restarted for the second time and I can't figure out why (I can do multiple C/R with the same daemon session).

The second issue is quite similar. It's when I try to checkpoint the container then kill the daemon and reboot my computer.
I can't restore the container (even the first time)
When I reboot and restart the daemon, the container is detected as checkpointed and the daemon doesn't kill it but the restore part fail:
Error response from daemon: Cannot restore container b5fb9575e133: [2] Container does not exist: open /var/run/docker/execdriver/native/b5fb9575e133740dca720556cef55142e57bebae743fda5ae26883a3015c0c3d/state.json: no such file or directory
Error: failed to restore one or more containers
logs of the daemon before the reboot http://pastebin.com/30SxLi6N
and after http://pastebin.com/RF560X94.

The folder with the container-id as name in /var/run/docker/execdriver/native is missing.
I tried to save it before the reboot and recopy it after starting the daemon but it wouldn't work.
The folder contains an empty file named checkpoint and state.json which content is http://pastebin.com/qapNJhwg
The folder is /var/lib/docker/contains with the container-id as name does change.

Sorry for my English, it's not my first language.

@boucher
Copy link
Author

boucher commented Sep 11, 2015

Are you still using the precompiled version of Docker posted to my releases
page? Are you able to try building the latest version of my branch? I
believe one of the issues fixed by @huikang is related to a bug with
storing the container state after checkpointing.

On Fri, Sep 11, 2015 at 9:05 AM, eliams notifications@github.com wrote:

@boucher https://github.com/boucher Thanks for the forward.
@avagin https://github.com/avagin Thanks for the reply.
Before reading it, I switched to a 4.0.0 kernel and it worked. I checked
and the cgroups are effectively enabled.
ismael@debian:~$ cat /proc/cgroups | grep memor
memory 5 63 1

However I have come to other issues, I will explain it here but tell me if
I need to make a new post.
When I checkpoint a container, I can kill the daemon, restart it and
restore the container.
Although it's only working the first time I kill the daemon, at the second
try the restore part fails.

I'm working on: Linux debian 4.0.0+ #2
https://github.com/xemul/criu/issues/2 SMP Fri Sep 11 14:27:32 CEST
2015 x86_64 GNU/Linux
criu version: 1.7 build from sources
I use aufs as storage driver for the daemon
I start the container with:
./docker run -d busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr
$i + 1); sleep 1; done'

First daemon logs until I kill it: http://pastebin.com/3zneGWDp
Second daemon logs, the container status is checkpointed. I restored the
container then checkpointed it again: http://pastebin.com/hRqMJ6SU
the last daemon log, this time it decided to kill the container.
I don't know why and the container status is exited which lead the restore
to fail: http://pastebin.com/gfrutcrV
Output of the restore command:
Error response from daemon: Container a740fc4a4d9f is not checkpointed
Error: failed to restore one or more containers

I don't have any restore logs since the daemon killed the container when
it was restarted for the second time and I can't figure out why (I can do
multiple C/R with the same daemon session).

The second issue is quite similar. It's when I try to checkpoint the
container then kill the daemon and reboot my computer.
I can't restore the container (even the first time)
When I reboot and restart the daemon, the container is detected as
checkpointed and the daemon doesn't kill it but the restore part fail:
Error response from daemon: Cannot restore container b5fb9575e133: [2]
Container does not exist: open
/var/run/docker/execdriver/native/b5fb9575e133740dca720556cef55142e57bebae743fda5ae26883a3015c0c3d/state.json:
no such file or directory
Error: failed to restore one or more containers
logs of the daemon before the reboot http://pastebin.com/30SxLi6N
and after http://pastebin.com/RF560X94.

The folder with the container-id as name in
/var/run/docker/execdriver/native is missing.
I tried to save it before the reboot and recopy it after starting the
daemon but it wouldn't work.
The folder contains an empty file named checkpoint and state.json which
content is http://pastebin.com/qapNJhwg
The folder is /var/lib/docker/contains with the container-id as name does
change.

Sorry for my English, it's not my first language.


Reply to this email directly or view it on GitHub
https://github.com/xemul/criu/issues/28#issuecomment-139586662.

@huikang
Copy link
Contributor

huikang commented Sep 11, 2015

@boucher @eliams I think you're talking about this PR
boucher/docker#14
The latest cr-combined branch from @boucher has already merged it.

@eliams
Copy link

eliams commented Sep 14, 2015

Sorry for the long response time, I didn't find some time during the weekend to test.
I tried to build to cr-combined and the test-huikang-fix branches with:
make build && make binary
but the resulting binaries does not have the checkpoint and restore commands.
What did I miss ?

@klesgidisold
Copy link

You have to build with the docker experimental flag

make DOCKER_EXPERIMENTAL=1 build && make DOCKER_EXPERIMENTAL=1 binary 

@eliams
Copy link

eliams commented Sep 14, 2015

@klesgidis Thanks !

So I tried both and end up with this:

The test-huikang-fix branch:

The restore part fail if I restart the daemon before restoring.

daemon logs before restart: http://pastebin.com/nuDwmFKw
daemon logs after restart: http://pastebin.com/JTb3Ev7v
restore commmand output:
Error response from daemon: Cannot restore container f728e5363b70: [2] Container does not exist: open /var/run/docker/execdriver/native/f728e5363b7072114ff3eeea1976fa46190c43c73a39c95fe34cf303fa296a2a/state.json: no such file or directory
Error: failed to restore one or more containers

The cr-combined branch:

C/R is working well and I was able to restart the daemon before each restore without trouble.
However the second issue is still here, I can't reboot my computer before restoring.

daemon logs before reboot: http://pastebin.com/gPt3kpPk
daemon logs after reboot: http://pastebin.com/Gbnj1uit
restore command output:
Error response from daemon: Cannot restore container 59673a43af6f: [2] Container does not exist: open /var/run/docker/execdriver/native/59673a43af6f4416d30c172ca0fd7916dda97b557cc93d7489f1d318175e2819/state.json: no such file or directory
Error: failed to restore one or more containers

It's look like it's still the /var/run/docker folder that is deleted when rebooting.
I tried to save it and restore it my self before restarting the daemon but it didn't work.

daemon logs: http://pastebin.com/LvAgh0kL
restore logs: http://pastebin.com/RLDiBAQB
restore command output:
Error response from daemon: Cannot restore container 59673a43af6f: criu failed: type RESTORE errno 0
Error: failed to restore one or more containers

@huikang
Copy link
Contributor

huikang commented Sep 14, 2015

@eliams Did you reboot the machine or just the docker deamon before your restoring the checkpoint container?

Which CRIU version did you use?

@eliams
Copy link

eliams commented Sep 15, 2015

@huikang When I only stop and restart the daemon it works, it's when I also reboot the machine that I am not able to restart the container.

EDIT: criu 1.7 builed from sources

EDIT-bis:
I noticed that each time I reboot my machine with a checkpointed container, in addition to not be able to restore that container, I can't start new ones:
$ docker run -d busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 2; done'
Error response from daemon: Cannot start container d0cb12c269345b6cd26a2deae6597042e2af4ada227cd87b904fa6cb94769b19: [8] System error: write /sys/fs/cgroup/cpuset/system.slice/docker-d0cb12c269345b6cd26a2deae6597042e2af4ada227cd87b904fa6cb94769b19.scope/cgroup.procs: no space left on device

daemon logs: http://pastebin.com/mAu1eexQ
To solve this, just need to do:
echo 0 > /sys/fs/cgroup/cpuset/system.slice/cpuset.mems
However I figured it may be helpful since it is linked to the fact that restoring won't work if I reboot my machine.

@huikang
Copy link
Contributor

huikang commented Sep 15, 2015

@eliams Thanks for the detailed information. I will look at it.

@huikang
Copy link
Contributor

huikang commented Sep 15, 2015

@eliams The reason that all the images are lost after rebooting the machine is that all directory under /var/run/ will be re-created each time you reboot the machine. Also make sure you do not have another default docker running, which may also create this directory.

So to avoid this, when you start the docker-checkpoint-v1.9 daemon, use --exec-root=/root/ to appoint to another place.

@boucher since we need to keep the checkpointed images persistent, the default exec-root should not be the ephemeral /var/run. Thought?

@boucher
Copy link
Author

boucher commented Sep 15, 2015

Yes, we should move it to /var/lib/docker

On Tue, Sep 15, 2015 at 1:48 PM, huikang notifications@github.com wrote:

@eliams https://github.com/eliams The reason that all the images are
lost after rebooting the machine is that all directory under /var/run/ will
be re-created each time you reboot the machine. Also make sure you do not
have another default docker running, which may also create this directory.

So to avoid this, when you start the docker-checkpoint-v1.9 daemon, use
--exec-root=/root/ to appoint to another place.

@boucher https://github.com/boucher since we need to keep the
checkpointed images persistent, the default exec-root should not be the
ephemeral /var/run. Thought?


Reply to this email directly or view it on GitHub
https://github.com/xemul/criu/issues/28#issuecomment-140534513.

@huikang
Copy link
Contributor

huikang commented Sep 15, 2015

@boucher cool. If you want, I can send you a PR based on your cr-combined branch later.

@eliams
Copy link

eliams commented Sep 16, 2015

I tried the --exec-root option, I end up with the same error and I still needed
the previous echo command to be able to run containers again.

Restore command output:
Error response from daemon: Cannot restore container 7695d91e663c: criu failed: type RESTORE errno 0
Error: failed to restore one or more containers

Daemon logs:
ERRO[0032] Error restoring container: criu failed: type RESTORE errno 0, exitCode={-1 %!d(bool=false)}
ERRO[0033] Handler for POST /containers/{name:.*}/restore returned error: Cannot restore container 7695d91e663c: criu failed: type RESTORE errno 0
ERRO[0033] HTTP Error err=Cannot restore container 7695d91e663c: criu failed: type RESTORE errno 0 statusCode=500

Restore logs:
(00.041444) cg: Preparing cgroups yard (cgroups restore mode 0x4)
(00.041486) cg: Opening .criu.cgyard.9VdaGJ as cg yard
(00.041493) cg: Making controller dir .criu.cgyard.9VdaGJ/cpuset (cpuset)
(00.041609) cg: Created cgroup dir cpuset/system.slice/docker-7695d91e663ca626934f3a50fdaf057470bd92212488804b809cac1c1d73387c.scope
(00.041634) Error (cgroup.c:978): cg: Failed closing cpuset/system.slice/docker-7695d91e663ca626934f3a50fdaf057470bd92212488804b809cac1c1d73387c.scope/cpuset.cpus: Permission denied
(00.041637) Error (cgroup.c:1086): cg: Restoring special cpuset props failed!

Full logs are in my previous message.
I tried to restore the container in a new one with --force=true and I end
up in the same situation (Restoring in a other container works fine if I don't
reboot the computer).

I tried to do the echo command before restoring the container and I got:
echo 0 > /sys/fs/cgroup/cpuset/system.slice/cpuset.mems
bash: /sys/fs/cgroup/cpuset/system.slice/cpuset.mems: No such file or directory

I also tried to redo the restore command after that the first one failed and that I did the echo 0. The second one didn't work either but I end up with different restore log: http://pastebin.com/j3t1N875

Finally I found by accident a way to make the restoring work, I just need to
run a new container before trying to restore the checkpointed container.
I can't figure out what does that change but I am interested to know the answer.

Here are the restore log: http://pastebin.com/Yt1DJq5s
And the daemon logs: http://pastebin.com/gZwWRTRD

xemul referenced this issue Sep 16, 2015
Some controllers can be disabled in kernel options. In this case they
are shown in /proc/cgroups, but they could not be mounted.

All enabled controllers can be collected from /proc/self/cgroup.

https://github.com/xemul/criu/issues/28

v2: ',' is used to separate controllers

Cc: Tycho Andersen <tycho.andersen@canonical.com>
Reported-by: Ross Boucher <boucher@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Tycho Andersen <tycho.andersen@canonical.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
@xemul
Copy link
Member

xemul commented Sep 16, 2015

Patch d3be641 regarding this issue is in master

@eliams
Copy link

eliams commented Sep 16, 2015

I tried with the patch, I still have the same issue.

@huikang
Copy link
Contributor

huikang commented Sep 16, 2015

@eliams Did using --exec-root solve the problem of missing criu images?

@xemul Will this patch in runc (opencontainers/runc#184) help to solve the cgroup problem?

@xemul
Copy link
Member

xemul commented Sep 16, 2015

@huikang it looks like we have several issues discussed here :) First one was -- disabled cgroups weren't handled by CRIU, not is' fixed by @avagin . The next issue was from @eliams regarding the killed and restarted Docker daemon. This particular one (AFAIU) is fixed in @boucher tree and the pull request you mention is it.

@huikang
Copy link
Contributor

huikang commented Sep 17, 2015

@eliams I use --exec-root to store checkpointed container images to a place other than /var/run/docker. Then I restart my machine and start docker daemon. The checkointed container can be restored without any fault.

My host OS is ubuntu 14.04
What is your host OS? Thanks.

(00.152500) Running network-unlock scripts
(00.152542)     RPC
(00.155729) Restore finished successfully. Resuming tasks.
(00.156029) 2613 was trapped
(00.156066) 2613 is going to execute the syscall ffffffffffffffff
(00.156146) 2599 was trapped
(00.156169) 2599 is going to execute the syscall ffffffffffffffff
(00.156245) 2613 was trapped
(00.156267) 2613 is going to execute the syscall f
(00.156548) 2613 was stopped
(00.156583) 2599 was trapped
(00.156596) 2599 is going to execute the syscall f
(00.157069) 2599 was stopped
(00.157509) 2599 was trapped
(00.157545) 2599 is going to execute the syscall b
(00.157965) 2599 was stopped
(00.158413) 2613 was trapped
(00.158442) 2613 is going to execute the syscall b

@eliams
Copy link

eliams commented Sep 17, 2015

@xemul Sorry, I did not realize for which issue the patch was. It effectively fix the disabled cgroups issue. Thanks !

The second issue (restarting the docker daemon before restoring worked only the first time. At the second time, the container was killed by the daemon at restart) was fixed in @boucher tree.

@huikang Using --exec-root solved the problem of missing criu image.

The last issue now is that when I reboot my machine I can't restore a container.
restore logs: http://pastebin.com/U6WYvRw1
After this error I can't run new container. To be able to do it again I have to execute this command.
echo 0 > /sys/fs/cgroup/cpuset/system.slice/cpuset.mems
I found that if I run a new container before trying to restore the container checkpointed before reboot the restoring don't fail.

I'm on debian with custon kernel:
Linux debian 4.0.0+ # 2 SMP Fri Sep 11 14:27:32 CEST 2015 x86_64 GNU/Linux
here is the config file used for compiling the kernel http://pastebin.com/b92cWgyG

criu:
Version: 1.7
GitID: v1.7-22-gd3be641

Docker:
Client:
Version: 1.9.0-dev
API version: 1.21
Go version: go1.4.2
Git commit: 2919249
Built: Mon Sep 14 09:41:18 UTC 2015
OS/Arch: linux/amd64
Experimental: true

Server:
Version: 1.9.0-dev
API version: 1.21
Go version: go1.4.2
Git commit: 2919249
Built: Mon Sep 14 09:41:18 UTC 2015
OS/Arch: linux/amd64
Experimental: true

@xemul
Copy link
Member

xemul commented Sep 17, 2015

@eliams cool :) The issue with uninitialized cpuset.mems is #16 . And this one is about the disabled memcg-s, so to keep things clear let me close it.

@xemul xemul closed this as completed Sep 17, 2015
avagin referenced this issue in avagin/criu Sep 17, 2015
Some controllers can be disabled in kernel options. In this case they
are shown in /proc/cgroups, but they could not be mounted.

All enabled controllers can be collected from /proc/self/cgroup.

https://github.com/xemul/criu/issues/28

v2: ',' is used to separate controllers

Cc: Tycho Andersen <tycho.andersen@canonical.com>
Reported-by: Ross Boucher <boucher@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants