Skip to content
This repository has been archived by the owner on Dec 13, 2018. It is now read-only.

cr: use the RPC protocol for communication with criu #486

Merged
merged 7 commits into from
Apr 9, 2015

Conversation

avagin
Copy link
Contributor

@avagin avagin commented Mar 27, 2015

criu swrk will be supported starting with CRIU 1.6 (1 June 2015), so this patch is only for review by now.

criu swrk is a special mode when we don't want to execute a service, but want to use RPC.

Here is more details:
http://lists.openvz.org/pipermail/criu/2015-March/019400.html

criu swrk allows to remove the hack with saving StdFds. And in future, we will be able to handle notifications from CRIU without additional scripts.

Jenkins will fail because goprotobuf are not added in dependencies.

Cc: @SaiedKazemi

@LK4D4
Copy link
Contributor

LK4D4 commented Mar 27, 2015

That's really awesome.

@avagin avagin changed the title [RFC] cr: use the RPC protocol for communication with criu [DRAFT] cr: use the RPC protocol for communication with criu Mar 27, 2015
@SaiedKazemi
Copy link
Contributor

@avagin @mrunalp
Yes, I agree that SWRK mode would be the way to go. The only issue is that currently we don't require a minimum version of CRIU because all command line options have been there for a long time (this is something that @mrunalp asked). In case someone is running a very old version of CRIU, it will print an error message and exit. Using CRIU in SWRK mode will require a version check of at least v1.6.

@xemul
Copy link

xemul commented Mar 27, 2015

@SaiedKazemi , yes but swrk mode itself exists in older versions too (it appeared in 1.2). AFAIU the only reason for 1.6 is that inherit-fd option will appear. Is absence of inherit-fd a blocker? I mean, will it be impossible to C/R anything without it?

@SaiedKazemi
Copy link
Contributor

@xemul, without inherit-fd support docker logs will not work. I assume most use cases would want this functionality. But I am not the best person to answer the question whether it's a blocker or not.

@xemul
Copy link

xemul commented Mar 27, 2015

@SaiedKazemi , I see. So without swrk's inherit_fd C/R in libcontainer will have to work with CRIU's CLI to be usable.

I've looked at the patch to CRIU @avagin has sent. I think we can make CRIU 1.5.1 with this patch to make swrk mode usable for libcontainer. All the more so we will have an issue with newer kernel soon that was fixed by Oleg Nesterov from RedHat.

@mrunalp
Copy link
Contributor

mrunalp commented Mar 27, 2015

👍 I just noticed that Fedora 21 has criu 1.3.1 Does it mean that the libcontainer criu integration won't work on it? I also see that @avagin is maintaining the criu packages on Fedora. Would it be possible to support newer versions of criu on Fedora 21 (it has kernel 3.19 now)?

@SaiedKazemi
Copy link
Contributor

That's a bummer. For restore to be fully functional, we needed two options that were added to CRIU v1.4. The --inherit-fd option (0412152fc5) and extending the already existing --veth-pair option (296129295a). This means that the CRIU command line options in the new libcontainer will not work with 1.3.1.

@avagin
Copy link
Contributor Author

avagin commented Mar 27, 2015

@mrunalp @SaiedKazemi @xemul I am going to update CRIU in FC21.

Ubuntu has CRIU 1.3.1 too. We need to update CRIU there too. So I think we can realease criu 1.5.1 and use criu swrk in libcontainer.

root@ubuntu:/home/avagin# criu -V
Version: 1.3.1
root@ubuntu:/home/avagin# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.10
DISTRIB_CODENAME=utopic
DISTRIB_DESCRIPTION="Ubuntu 14.10"

@xemul
Copy link

xemul commented Mar 27, 2015

OK, so the next steps would be 1.5.1 and Fedora's package update :)

@mrunalp
Copy link
Contributor

mrunalp commented Mar 27, 2015

@avagin Thanks! :)
@xemul That sounds like a plan.

@avagin avagin force-pushed the swrk branch 2 times, most recently from 73c55ec to c58d357 Compare March 30, 2015 12:24
@xemul
Copy link

xemul commented Mar 31, 2015

So, the criu-1.5.1 is out :) Feel free to pick one up and use!

@avagin avagin changed the title [DRAFT] cr: use the RPC protocol for communication with criu cr: use the RPC protocol for communication with criu Apr 1, 2015
@avagin
Copy link
Contributor Author

avagin commented Apr 1, 2015

I think this request is ready to be merged. @SaiedKazemi @crosbymichael @LK4D4 @mrunalp could you review this?

@SaiedKazemi
Copy link
Contributor

@avagin @crosbymichael @LK4D4 @mrunalp @xemul
Yes, I think it's ready. There is, however, an issue with storing CRIU images directory inside the container's root. As you saw in my comment, the caller has to pass in the directory as an argument (preferably outside the container's root) to Checkpoint() and Restore().

In the background, I am working with Pavel and Tycho Andersen of Canonical to prevent CRIU's temporary cgyard mount points leaking into the container. But this should not be a show stopper for merging this PR.

@mrunalp
Copy link
Contributor

mrunalp commented Apr 1, 2015

@avagin @SaiedKazemi Thanks! Yes, I will try this out and add review comments if any (hopefully by tomorrow).

@avagin avagin force-pushed the swrk branch 2 times, most recently from a062638 to 0797e7b Compare April 2, 2015 13:20
@avagin
Copy link
Contributor Author

avagin commented Apr 2, 2015

@SaiedKazemi This request is only about criu swrk. I agree that a user should be able to specify a checkpoint directory. As for cgyard I think it's a good to split workdir and imagedir. In this case the problem with cgyard will disapear, if I don't mess something. I have added a commit which sets a separate directory as workdir for criu.

@SaiedKazemi
Copy link
Contributor

@avagin Great. With your commit to pass an image directory outside container's root, we won't need to change CRIU.

@crosbymichael
Copy link
Contributor

The image directory should never be inside the container's root. That is just an implementation detail of running nsinit from the current dir. you can change it via nsinit --root. It will always and should always be in another place

@SaiedKazemi
Copy link
Contributor

@crosbymichael
Yes, that's how it should be done. But the current code in Checkpoint() is hard coded to create it as:
dir := filepath.Join(c.root, "checkpoint")
which ends up in the container's root -- that's why I marked it with XXX.

@avagin said that he already has a commit that passes the image directory path to both Checkpoint() and Restore().

@avagin
Copy link
Contributor Author

avagin commented Apr 2, 2015

@SaiedKazemi you misunderstood me. I split imagedir and workdir. workdir contains temporary files. imagedir is required for restoring a container. Pls, take a look at avagin@0797e7b

@mrunalp
Copy link
Contributor

mrunalp commented Apr 2, 2015

I built criu 1.5.1 on F21 and I see a failure when I try to checkpoint a container

[root@localhost redis]# nsinit exec -t --net host /usr/local/bin/redis-server

# Another terminal
[root@localhost redis]# nsinit --criu /usr/bin/criu checkpoint                                                       
>>> criu [dump -v4 --images-dir /redis/nsinit/checkpoint --work-dir /redis/nsinit/criu.work -o dump.log --root /redis --manage-cgroups --evasive-devices -t 15126]
exit status 1
[root@localhost redis]# tail nsinit/checkpoint/dump.log 
(00.038113)   <--
(00.038115)   [./dev/mqueue](143->140)
(00.038116)   <--
(00.038117)  <--
(00.038119) <--
(00.038122) Error (mount.c:624): 145:./dev/console doesn't have a proper root mount
(00.038124) Unlock network
(00.038126) Unfreezing tasks into 1
(00.038127)     Unseizing 14482 into 1
(00.038150) Error (cr-dump.c:1979): Dumping FAILED.

@mrunalp
Copy link
Contributor

mrunalp commented Apr 2, 2015

I am also seeing the same failure on the criu branch (without these changes).

@mrunalp
Copy link
Contributor

mrunalp commented Apr 2, 2015

@avagin Could we clean up the commit history a bit? It is difficult to load the entire PR in a web page so going commit by commit to review. However, I see that some code was removed in later commits.

@mrunalp
Copy link
Contributor

mrunalp commented Apr 2, 2015

@avagin Up to you though, I can manage either ways :)

@crosbymichael
Copy link
Contributor

@SaiedKazemi container.root is not Container.Rootfs They are two different directories. The rootfs is the filesystem for the container. The container.root is where runtime state like configs are saved and can be used for criu image also.

We dont' have to worry about the TTY usecase because it's not common and we can keep this morning and get it merged soon. TTY is a nice to have, not a requirement by a long shot.

@SaiedKazemi
Copy link
Contributor

@crosbymichael What I see is:
Rootfs: /home/saied/work/nsinit/busybox
root: /home/saied/work/nsinit/busybox/nsinit
Therefore, the "checkpoint" directory that Checkpoint() creates is inside the container's root filesystem. As a result, if you run mount after restore, you will see a bunch of cgyard mount points in the container.
[Terminal A]
$ sudo nsinit exec -- sh -i
[Terminal B]
$ sudo nsinit -criu /usr/local/bin/criu checkpoint
[Termina A]
$ sudo nsinit -criu /usr/local/bin/criu restore
/ # mount
...
none on /nsinit/checkpoint/.criu.cgyard.mc2dLq type tmpfs (rw,relatime)
...
If you try this, you will see the same thing. Have I missed something?

@avagin
Copy link
Contributor Author

avagin commented Apr 3, 2015

@SaiedKazemi I would recomend you to use the --root option. Default values cannot be suitable for all cases.
nsinit --root /var/run/nsinit exec ...

@crosbymichael
Copy link
Contributor

Lets change the default so it's less confusing when testing

@crosbymichael
Copy link
Contributor

I opened #507 to change it so it's less confusing for people.

@crosbymichael
Copy link
Contributor

Ok, it's merged and I also rebased the criu branch. Sorry @avagin

In this case CRIU will exit after restoring processes. Here is
no reason to wait the init process.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
criu swrk is a special mode when we don't want to execute a service,
but want to use RPC.

Here is more details:
http://lists.openvz.org/pipermail/criu/2015-March/019400.html

Another good feature of this mode is that we don't need to create
action scripts and we will be able to remove the hack with saving StdFds.

criu swrk is supported starting with CRIU 1.5.1.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
CRIU version must be 1.5.1 or higher

Signed-off-by: Andrey Vagin <avagin@openvz.org>
This directory can be removed when criu completes.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
@avagin
Copy link
Contributor Author

avagin commented Apr 3, 2015

@crosbymichael I've rebased this branch too.

Everything must be in vendor/

Signed-off-by: Andrey Vagin <avagin@openvz.org>
@SaiedKazemi
Copy link
Contributor

@crosbymichael @avagin Anything to do for this PR to merge? Also, what needs to be done for the criu branch to merge into master?

@avagin
Copy link
Contributor Author

avagin commented Apr 7, 2015

@crosbymichael @SaiedKazemi I think we can merge it into the criu branch.

@vmarmol
Copy link
Contributor

vmarmol commented Apr 9, 2015

Spoke offline with @SaiedKazemi and I'll go ahead and merge this into the CRIU branch. We've been more open with allowing some development like this in branches before we merge into master (same as we did for the API branch). Let me know if you completely disagree @crosbymichael, but I think this is Kosher from what we've done before :)

vmarmol added a commit that referenced this pull request Apr 9, 2015
cr: use the RPC protocol for communication with criu
@vmarmol vmarmol merged commit 7b3359e into docker-archive:criu Apr 9, 2015
@crosbymichael
Copy link
Contributor

@vmarmol your a maintainer just like me, no worries ;)

@SaiedKazemi
Copy link
Contributor

@vmarmol @crosbymichael
Great. What's the plan to merge into master?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants