Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement container restore functionality #2335

Open
YJDoc2 opened this issue Sep 1, 2023 · 12 comments
Open

Implement container restore functionality #2335

YJDoc2 opened this issue Sep 1, 2023 · 12 comments

Comments

@YJDoc2
Copy link
Collaborator

YJDoc2 commented Sep 1, 2023

ref : continues from #142
Currently youki supports checkpointing, (with command name checkpointt) , but the restore part has not been implemented yet. We should do that.

Note Before anyone starts with this, make sure the following commands are working as expected on your system:

# run a loop which keeps printing numbers, with runc runtime ; in background
sudo podman run --runtime runc -dt fedora bash -c "v=0;while true;do sleep 1; echo \"\$v\"; let \"v++\";done;"
# get the name/id of the launched container
sudo podman ps

# this will attach current console to the container. DO NOT do ctrl+c to exit, instead use `a` key
# keep running for some time, let the number increase
sudo podman attach <container-id/name> --detach-keys=a

# enter a, and detach again

# checkpoint and shut-down container
sudo podman container checkpoint <container-id/name>

# wait for a bit

# restore the container
sudo podman container restore <container-id/name>

# attach again immediately 
sudo podman attach <container-id/name> --detach-keys=a

# in the output you see should print numbers >greater than what we saw in previous attach with considerable range

After that implement restore in youki, rename the checkpointt to checkpoint and make the above work with youki instead of runc.

Another Note criu library is quite specific with which kernel versions it supports and need. If you run into criu failure with seg-fault , check previous issues on criu and check if you need to upgrade/downgrade library version for your kernel.

I ran above on ubuntu-based, kernel version 6.4.6 , criu v3.17.1 (3.16 does not work)

@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Sep 1, 2023

@anti-entropy123
Copy link
Contributor

anti-entropy123 commented Sep 23, 2023

Hey, I'm trying to research checkpoint and restore. However, I've noticed that there seem to be some problems with the current checkpoint implementation.

> sudo podman run --runtime ~/rust_project/youki/youki -dt fedora bash -c "v=0;while true;do sleep 1; echo \"\$v\"; let \"v++\";done;"

> sudo podman container --runtime ~/rust_project/youki/youki ps                 
CONTAINER ID  IMAGE                                     COMMAND               CREATED       STATUS           PORTS       NAMES
eb0b484cafdc  registry.fedoraproject.org/fedora:latest  bash -c v=0;while...  17 minutes ago  Up 17 minutes ago              musing_cohen

> sudo podman container --runtime ~/rust_project/youki/youki checkpoint eb0b484cafdc21a4d9
DEBUG youki: started by user 0 with ArgsOs { inner: ["/home/yjn/rust_project/youki/youki", "checkpoint", "--image-path", "/var/lib/containers/storage/overlay-containers/eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280/userdata/checkpoint", "--work-path", "/var/lib/containers/storage/overlay-containers/eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280/userdata", "eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280"] }
DEBUG youki::commands::checkpoint: start checkpointing container eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280
ERROR libcontainer::container::container_checkpoint: failed to open criu image directory path="/var/lib/containers/storage/overlay-containers/eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280/userdata/checkpoint" err=Os { code: 2, kind: NotFound, message: "No such file or directory" }
ERROR youki: error in executing command: failed to checkpoint container eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280

Caused by:
    0: io error
    1: No such file or directory (os error 2)
Error: failed to checkpoint container eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280

Caused by:
    0: io error
    1: No such file or directory (os error 2)
Error: `/home/yjn/rust_project/youki/youki checkpoint --image-path /var/lib/containers/storage/overlay-containers/eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280/userdata eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280` failed: exit status 1

Of course, I've made the necessary changes to rename the checkpointt subcommand to checkpoint.

> git diff | cat
diff --git a/crates/liboci-cli/src/lib.rs b/crates/liboci-cli/src/lib.rs
index 89c48a6d..03a5ae2e 100644
--- a/crates/liboci-cli/src/lib.rs
+++ b/crates/liboci-cli/src/lib.rs
@@ -50,7 +50,7 @@ pub enum StandardCmd {
 // and other runtimes.
 #[derive(Parser, Debug)]
 pub enum CommonCmd {
-    Checkpointt(Checkpoint),
+    Checkpoint(Checkpoint),
     Events(Events),
     Exec(Exec),
     Features(Features),
diff --git a/crates/youki/src/main.rs b/crates/youki/src/main.rs
index 6a92be8d..7f0e23c7 100644
--- a/crates/youki/src/main.rs
+++ b/crates/youki/src/main.rs
@@ -116,7 +116,7 @@ fn main() -> Result<()> {
             StandardCmd::State(state) => commands::state::state(state, root_path),
         },
         SubCommand::Common(cmd) => match *cmd {
-            CommonCmd::Checkpointt(checkpoint) => {
+            CommonCmd::Checkpoint(checkpoint) => {
                 commands::checkpoint::checkpoint(checkpoint, root_path)
             }
             CommonCmd::Events(events) => commands::events::events(events, root_path),

I believe the cause of the error is likely not related to my system environment (e.g., CRIU) because I can perform checkpoint and restore using podman + runc.

@anti-entropy123
Copy link
Contributor

After resolving the mentioned No such file or directory error, there are still some CRIU-related errors:

> sudo podman container checkpoint fb8bc5974
DEBUG youki: started by user 0 with ArgsOs { inner: ["/home/yjn/rust_project/youki/youki", "checkpoint", "--image-path", "/var/lib/containers/storage/overlay-containers/fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea/userdata/checkpoint", "--work-path", "/var/lib/containers/storage/overlay-containers/fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea/userdata", "fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea"] }
DEBUG youki::commands::checkpoint: start checkpointing container fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea
ERROR libcontainer::container::container_checkpoint: checkpointing container failed err="CRIU RPC request failed with message:Error (criu/files-reg.c:1815): Can't lookup mount=26 for fd=0 path=/dev/pts/5\n error:0" id="fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea" logfile="/var/lib/containers/storage/overlay-containers/fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea/userdata/checkpoint/dump.log"
ERROR youki: error in executing command: failed to checkpoint container fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea

Caused by:
    CRIU RPC request failed with message:Error (criu/files-reg.c:1815): Can't lookup mount=26 for fd=0 path=/dev/pts/5
     error:0
Error: failed to checkpoint container fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea

Caused by:
    CRIU RPC request failed with message:Error (criu/files-reg.c:1815): Can't lookup mount=26 for fd=0 path=/dev/pts/5
     error:0
Error: `/home/yjn/rust_project/youki/youki checkpoint --image-path /var/lib/containers/storage/overlay-containers/fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea/userdata fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea` failed: exit status 1

I'd like to know if these errors can be reproduced by others? Is it necessary to open a separate issue to address the potential problems with checkpoint?

@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Sep 25, 2023

After resolving the mentioned No such file or directory error,

Hey, can you mention what what the issue behind this error, and how did you resolve it?

I'd like to know if these errors can be reproduced by others? Is it necessary to open a separate issue to address the potential problems with checkpoint?

I haven't tried running checkpoint before, I had assumed it was working. I will try running and checking for errors that you encountered. If this is a bug, and not just setup issue, then we can either fix it along with restore impl, or open separate issue.

@anti-entropy123
Copy link
Contributor

can you mention what what the issue behind this error, and how did you resolve it?

It's quite simple, just create the missing directories directly. (It seems to be the case in runc as well.)

> git diff | cat -p -P
diff --git a/crates/libcontainer/src/container/container_checkpoint.rs b/crates/libcontainer/src/container/container_checkpoint.rs
index a6054734..25a08ba6 100644
--- a/crates/libcontainer/src/container/container_checkpoint.rs
+++ b/crates/libcontainer/src/container/container_checkpoint.rs
@@ -15,6 +15,10 @@ const DESCRIPTORS_JSON: &str = "descriptors.json";
 
 impl Container {
     pub fn checkpoint(&mut self, opts: &CheckpointOptions) -> Result<(), LibcontainerError> {
+        if !opts.image_path.is_dir() {
+            fs::create_dir_all(&opts.image_path).expect("failed.")
+        };
+        
         self.refresh_status()?;
 
         // can_pause() checks if the container is running. That also works for

@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Sep 26, 2023

There is indeed some issue in checkpoint impl, as same error also occurs on my system as well. There is an issue open on criu that has similar to error checkpoint-restore/criu#1785 , but needs more investigation on why it is happening. Thanks for checking and reporting. The initial issue of image_path not existing also needs to be checked, verifying on how runc handles this...

@anti-entropy123
Copy link
Contributor

Thanks for your sharing too~

verifying on how runc handles this

I noticed that, the way runc handles image_path is also to create it directly. https://github.com/opencontainers/runc/blob/a32ad76da330c20c27b79ccbd20ff58629fc4b7d/libcontainer/criu_linux.go#L303C15-L303C15

@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Sep 26, 2023

CRIU RPC request failed with message:Error (criu/files-reg.c:1815): Can't lookup mount=26 for fd=0 path=/dev/pts/5

@adrianreber can you help us with some suggestions regarding this error in checkpointing ? I saw the issues checkpoint-restore/criu#860 and checkpoint-restore/criu#1785 , but the kernel issue in the first one does not seem applicable. As mentioned by @anti-entropy123 , the checkpointing is working with runc, so what can be a potential cause for this particular error, or what might be a good idea for trying to debug this?

@adrianreber
Copy link
Contributor

Not sure what the question is, but I can only recommend not to use Ubuntu for CRIU. There are non upstream kernel patches which break CRIU all the time. Sorry.

@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Sep 26, 2023

Hey, sorry if I wasn't clear :

while trying out the current implementation of youki's checkpoint both @anti-entropy123 and me are getting error from criu CRIU RPC request failed with message:Error (criu/files-reg.c:1815): Can't lookup mount=26 for fd=0 path=/dev/pts/5 ; even though running checkpoint with runc works fine and gives no error. I didn't think that kernel would be an issue as runc is successful in using criu to checkpoint.
The issues I linked in the previous comment were about the same error message, but the first one is regarding kernel problems (on which even runc was failing, hence I don't think it is applicable here), and second one is still open. I wanted to ask if you have any idea why this error might crop up, or any good place to start debugging why this error is getting thrown? Thanks :)

@adrianreber
Copy link
Contributor

Ah, I see. The current implementation does only work without a connected terminal. To handle the terminal correctly additional steps are necessary. Especially during restore a callback is necessary to tell youki the correct tty FD.

You should look at crun as the criu rust bindings are closer to the c bindings from the architecture.

@anti-entropy123
Copy link
Contributor

Thank you for your help. I removed the -t flag, and now it's working fine. @adrianreber

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants