Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

criu-ns: Add tests for criu-ns script #2114

Merged
merged 5 commits into from
May 28, 2023

Conversation

warusadura
Copy link
Member

These changes add test implementations for criu-ns script.

Fixes: #1909

@mihalicyn mihalicyn requested a review from Snorch March 8, 2023 13:10
@Snorch
Copy link
Member

Snorch commented Mar 10, 2023

You can enable your test to run in CI like this:

diff --git a/scripts/ci/run-ci-tests.sh b/scripts/ci/run-ci-tests.sh
index 5b9f6d929..dc82432ef 100755
--- a/scripts/ci/run-ci-tests.sh
+++ b/scripts/ci/run-ci-tests.sh
@@ -259,6 +259,7 @@ if [ -n "$TRAVIS" ] || [ -n "$CIRCLECI" ]; then
        # GitHub Actions (and Cirrus CI) does not provide a real TTY and CRIU will fail with:
        # Error (criu/tty.c:1014): tty: Don't have tty to inherit session from, aborting
        make -C test/others/shell-job/ run
+       make -C test/others/criu-ns/ run
 fi
 make -C test/others/skip-file-rwx-check/ run
 make -C test/others/rpc/ run

@warusadura
Copy link
Member Author

Understood, thanks @Snorch

@mihalicyn
Copy link
Member

@warusadura do you have any specific questions about pty?

@warusadura
Copy link
Member Author

warusadura commented Mar 13, 2023

@warusadura do you have any specific questions about pty?

@mihalicyn thanks for asking. Yes,
the most troublesome one is, I can't wrap my head around the concepts of master and slave: pty.openpty(). And the Python documentation is bit cryptic and doesn't provide any insight on them :)

@codecov-commenter
Copy link

codecov-commenter commented Mar 13, 2023

Codecov Report

Patch coverage has no change and project coverage change: +0.05 🎉

Comparison is base (fbe2692) 70.38% compared to head (eb41d8e) 70.43%.

❗ Current head eb41d8e differs from pull request most recent head 4898e1f. Consider uploading reports for the commit 4898e1f to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##           criu-dev    #2114      +/-   ##
============================================
+ Coverage     70.38%   70.43%   +0.05%     
============================================
  Files           133      133              
  Lines         34008    34008              
============================================
+ Hits          23936    23955      +19     
+ Misses        10072    10053      -19     

see 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@warusadura warusadura marked this pull request as draft March 15, 2023 01:53
@warusadura warusadura force-pushed the issue#1909 branch 3 times, most recently from 8f45a00 to 77a33f2 Compare March 16, 2023 11:05
@warusadura
Copy link
Member Author

@warusadura do you have any specific questions about pty?

@mihalicyn thanks for asking. Yes, the most troublesome one is, I can't wrap my head around the concepts of master and slave: pty.openpty(). And the Python documentation is bit cryptic and doesn't provide any insight on them :)

@mihalicyn apologies if this question was too broad to answer. Most of the doubts I had about pty was cleared reading this article I found. But, still I couldn't find anything clear related to Python3 :)

I just pushed some new changes. The test_dump_and_restore_with_shell_job() works as expected. But, run.py hangs the execution right after the completion of criu-ns restore. This is expected. But, how do I avoid this (kill the restored process so that remaining tests can run)?

Also, the test_dump_and_restore_without_shell_job() fails the criu-ns dump with this dump.log. I have cleared all the terminal dependencies (fd redirection, new session). So, I'm not sure what I'm doing wrong here.

@warusadura
Copy link
Member Author

Also, the CI is complaining a FileNotFoundError. Is the missing file criu binary executable?

@warusadura warusadura marked this pull request as ready for review March 16, 2023 14:25
test/others/criu-ns/run.py Fixed Show fixed Hide fixed
test/others/criu-ns/run.py Fixed Show fixed Hide fixed
test/others/criu-ns/run.py Fixed Show fixed Hide fixed
test/others/criu-ns/run.py Fixed Show fixed Hide fixed
test/others/criu-ns/run.py Fixed Show fixed Hide fixed
test/others/criu-ns/run.py Fixed Show fixed Hide fixed
@mihalicyn
Copy link
Member

@warusadura do you have any specific questions about pty?

@mihalicyn thanks for asking. Yes, the most troublesome one is, I can't wrap my head around the concepts of master and slave: pty.openpty(). And the Python documentation is bit cryptic and doesn't provide any insight on them :)

@mihalicyn apologies if this question was too broad to answer. Most of the doubts I had about pty was cleared reading this article I found. But, still I couldn't find anything clear related to Python3 :)

I just pushed some new changes. The test_dump_and_restore_with_shell_job() works as expected. But, run.py hangs the execution right after the completion of criu-ns restore. This is expected. But, how do I avoid this (kill the restored process so that remaining tests can run)?

I think you can do it the same way as Andrei did https://github.com/checkpoint-restore/criu/blame/criu-dev/test/others/shell-job/run.py#L24

As you can see that test spawns a new process which sleeps forever until file with the name running will not be removed.
It's removed just before criu restore ... at https://github.com/checkpoint-restore/criu/blame/criu-dev/test/others/shell-job/run.py#L50

Also, the test_dump_and_restore_without_shell_job() fails the criu-ns dump with this dump.log. I have cleared all the terminal dependencies (fd redirection, new session). So, I'm not sure what I'm doing wrong here.

As I can see from the log you still have external pts:

(00.016032) 
(00.016035) Dumping opened files (pid: 129976)
(00.016038) ----------------------------------------
(00.016045) Sent msg to daemon 71 0 0
pie: 129976: __fetched msg: 71 0 0
pie: 129976: __sent ack msg: 71 71 0
pie: 129976: Daemon waits for command
(00.016076) Wait for ack 71 on daemon socket
(00.016082) Fetched ack: 71 71 0
(00.016111) 129976 fdinfo 0: pos:                0 flags:                2/0
(00.016128) tty: Dumping tty 12 with id 0xe
(00.016134) Dumping path for 0 fd via self 12 [/dev/pts/67]

Self-check question:

    fd_m, fd_s = create_pty()
    pid = os.fork()
    if pid == 0:
        os.setsid()
        cmd = ["sleep", "inf"]
        print("Run: %s" % " ".join(cmd))
        subprocess.Popen(cmd, stdin=fd_s, stdout=fd_s,
                               stderr=fd_s).wait()

How many processes you have after this piece of code got executed (suppose that before you had only calling process, i.e. 1)?
Answer is 3.

Why? You have: calling process, then forked process, then child of forked process created with subprocess.Popen.

Now compare this with Andrei's code:

pid = os.fork()
if pid == 0:
    m.close()
    os.setsid()
    os.dup2(s.fileno(), 0) # <== this closes 0 opened previously and replace it with a new fd
    os.dup2(s.fileno(), 1)
    os.dup2(s.fileno(), 2)
    fcntl.ioctl(s.fileno(), termios.TIOCSCTTY, 1)
    pr.close()
    pw.close()
    while True:
        if not os.access("running", os.F_OK):
            sys.exit(0)
        time.sleep(1)
    sys.exit(1)

As you can see we have a few calls to os.dup2(...). This syscall silently closes existing FD if it was opened and replaces it with a new one. At this point we lose connection with an external pts. In your code I can't anything similar to that.

@warusadura
Copy link
Member Author

@mihalicyn Thanks for the detailed response. Understood!

Progress update:
with --shell-job dump & restore - done
without --shell-job dump - done

during the without --shell-job restore process I'm getting this restore.log. Can you please help me to make sense of it. This is the dump.log

Also, the CI is complaining a FileNotFoundError. Is the missing file, criu binary executable? How do I avoid this?

test/others/criu-ns/run.py Fixed Show fixed Hide fixed
test/others/criu-ns/run.py Fixed Show fixed Hide fixed
test/others/criu-ns/run.py Fixed Show fixed Hide fixed
test/others/criu-ns/run.py Fixed Show fixed Hide fixed
@mihalicyn
Copy link
Member

@mihalicyn Thanks for the detailed response. Understood!

Progress update: with --shell-job dump & restore - done without --shell-job dump - done

during the without --shell-job restore process I'm getting this restore.log. Can you please help me to make sense of it. This is the dump.log

Theoretically, we shouldn't face such issue, because if criu dump is successful, then we guarantee that criu restore should be successful too (except a few things, like filesystem state was changed, or we have no specific kernel features on the destination node). But in your case it's just a local dump/restore on the same system.

Let's check what's happening on restore:

(00.002342) 625409: tty: Link PTYs (0xe)
(00.002345) 625409: tty: head driver pts id 0xe index 72 (master 0 sid 0 pgrp 0 inherit 0)
(00.002353) 625409: tty:     `- sibling driver ptmx id 0x10 index 72 (master 1 sid 0 pgrp 0 inherit 0)
...
(00.002855) 625409: Opening fdinfo-s
(00.002858) 625409: tty: open driver pts id 0xe index 72 (master 0 sid 0 pgrp 0 inherit 0)
(00.002868) 625409: 	Receive fd for 1
(00.002872) 625409: 	Receive fd for 2
(00.002876) 625409: tty: open driver ptmx id 0x10 index 72 (master 1 sid 0 pgrp 0 inherit 0)
(00.002954) 625409: tty: 		ptmx opened with index 73
(00.002960) 625409: Error (criu/tty.c:624): tty: Unable to open dev/ptmx with specified index 72
(00.002997) 625409: Error (criu/tty.c:1073): tty: Can't open master pty 0x10 (index 72)
(00.003001) 625409: Error (criu/files.c:1213): Unable to open fd=3 id=0x10
(00.003041) Error (criu/cr-restore.c:2543): Restoring FAILED.

This happens because you haven't closed dev/ptmx, as it done here (https://github.com/checkpoint-restore/criu/blame/criu-dev/test/others/shell-job/run.py#L26). So, at the dump stage criu saved this FD in the images, then on restore CRIU tries to restore this ptmx device with the same index = 72, but this index is busy (because parent process = your script process is still holding this ptmx).

Also, the CI is complaining a FileNotFoundError. Is the missing file, criu binary executable? How do I avoid this?

You can reproduce the issue locally using make -C test/others/criu-ns/ run. Problem is here:

def run_criu(args):
    """
    Spawn CRIU binary
    """
    print(sys.argv)
    os.execlp('criu', *['criu'] + args) # <==== searches criu binary in the current directory + $PATH environment
    raise OSError(errno.ENOENT, "No such command")

You can adjust $PATH appropriately before calling criu-ns.

@Snorch
Copy link
Member

Snorch commented Mar 19, 2023

I have some ideas about the "not found" error in criu-ns, we can add to criu-ns script:

  • environment variable CRIU_BINARY to override default criu path (easier)
  • --criu-binary option to override default criu path (harder, as you would need to somehow separate this option from criu options which is probably not obvious)

and override criu binary to self-compiled version in test.

@warusadura warusadura force-pushed the issue#1909 branch 2 times, most recently from 0753706 to a62bac7 Compare March 20, 2023 11:42
@warusadura warusadura changed the title criu-ns: Add tests for criu-ns script (WIP) criu-ns: Add tests for criu-ns script Mar 20, 2023
@warusadura
Copy link
Member Author

The CI is working \o/

@warusadura warusadura removed the request for review from Snorch March 20, 2023 12:15
test/others/criu-ns/run.py Outdated Show resolved Hide resolved
test/others/criu-ns/run.py Outdated Show resolved Hide resolved
test/others/criu-ns/run.py Outdated Show resolved Hide resolved
test/others/criu-ns/run.py Outdated Show resolved Hide resolved
scripts/ci/run-ci-tests.sh Outdated Show resolved Hide resolved
@Snorch
Copy link
Member

Snorch commented May 18, 2023

You should always fix/explain ci fails, now you have 6:

Fedora ASAN Test - likely unrelared

2023-05-06T17:58:25.5231729Z ############### Test zdtm/static/cr_veth02 FAIL at result check ################
2023-05-06T17:58:25.5232029Z Test output: ================================
2023-05-06T17:58:25.5232367Z 3: zdtmvthc0@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
2023-05-06T17:58:25.5232819Z     link/ether c2:4d:19:cb:e9:8e brd ff:ff:ff:ff:ff:ff link-netnsid 0
2023-05-06T17:58:25.5233068Z 3,4d2
2023-05-06T17:58:25.5233294Z <     inet 192.168.124.124/32 scope global zdtmvthc0
2023-05-06T17:58:25.5233575Z <        valid_lft forever preferred_lft forever
2023-05-06T17:58:25.5233937Z 17:58:25.106:     4: FAIL: cr_veth02.c:49: Net config differs after restore (errno = 11 (Resource temporarily unavailable))
2023-05-06T17:58:25.5234178Z
2023-05-06T17:58:25.5234266Z  <<< ================================

Fedora Rawhide Test - likely unrelated

2023-05-06T18:15:12.0262107Z Wait for zdtm/static/uffd-events(92) to die for 12.800000
2023-05-06T18:15:12.0262725Z Wait for zdtm/static/uffd-events(92) to die for 25.600000
2023-05-06T18:15:12.0263244Z ####### Test zdtm/static/uffd-events FAIL at zdtm/static/uffd-events die #######

CentOS 7 based test - obviousely yours to fix

+ make -C test/others/criu-ns/ run
make[1]: Entering directory `/tmp/criu/test/others/criu-ns'
make[2]: Entering directory `/tmp/criu/test'
make[2]: `zdtm_ct' is up to date.
make[2]: Leaving directory `/tmp/criu/test'
../../zdtm_ct run.py
timens isn't supported on 3.10.0-1160.88.1.el7.x86_64
Traceback (most recent call last):
  File "run.py", line 5, in <module>
    import pathlib
ImportError: No module named pathlib
make[1]: *** [run] Error 1
make[1]: Leaving directory `/tmp/criu/test/others/criu-ns'

CentOS Stream 8 based test/CentOS Stream 9 based test - obviousely yours to fix (there should be no external or detached mounts in your test)

+ make -C test/others/criu-ns/ run
make[1]: Entering directory '/tmp/criu/test/others/criu-ns'
make[2]: Entering directory '/tmp/criu/test'
make[2]: 'zdtm_ct' is up to date.
make[2]: Leaving directory '/tmp/criu/test'
../../zdtm_ct run.py
...
(00.002259) ========================================
(00.002265) Dumping task (pid: 3 comm: python)
(00.002266) ========================================
...
(00.005103) Dumping opened files (pid: 3)
(00.005104) ----------------------------------------
(00.005114) Sent msg to daemon 71 0 0
pie: 3: __fetched msg: 71 0 0
pie: 3: __sent ack msg: 71 71 0
pie: 3: Daemon waits for command
(00.005144) Wait for ack 71 on daemon socket
(00.005147) Fetched ack: 71 71 0
(00.005231) 3 fdinfo 0: pos:                0 flags:           100000/0
(00.005255) Error (criu/files-reg.c:1815): Can't lookup mount=24 for fd=0 path=/dev/null
(00.005259) ----------------------------------------
(00.005264) Error (criu/cr-dump.c:1669): Dump files (pid: 3) failed with -1
(00.005274) Waiting for 3 to trap
(00.005294) Daemon 3 exited trapping
(00.005299) Sent msg to daemon 3 0 0
pie: 3: __fetched msg: 3 0 0
pie: 3: 3: new_sp=0x7f17b506d048 ip 0x7f17b70b2639
(00.014481) 3 was trapped
(00.014547) 3 was trapped
(00.014553) 3 (native) is going to execute the syscall 15, required is 15
(00.014616) 3 was stopped
(00.014827) net: Unlock network
(00.014835) Unfreezing tasks into 1
(00.014837)     Unseizing 3 into 1
(00.014851) Error (criu/cr-dump.c:2093): Dumping FAILED.

Vagrant Fedora based test (no VDSO) - likely unrelated

================= Run zdtm/static/socket-tcp-nfconntrack in h ==================
Start test
Test is SUID
./socket-tcp-nfconntrack --pidfile=socket-tcp-nfconntrack.pid --outfile=socket-tcp-nfconntrack.out
Run criu dump
Run criu restore
=[log]=> dump/zdtm/static/socket-tcp-nfconntrack/59/1/restore.log
------------------------ grep Error ------------------------
b'(00.012506)     59: net: \tRunning ip rule delete table local'
b'(00.014389)     59: net: \tRunning ip rule restore'
b'(00.020732)     59: net: \tRunning iptables-restore -w for iptables-restore -w'
b'(00.025267)     59: net: \tRunning ip6tables-restore -w for ip6tables-restore -w'
b'(00.030073)     59: Error (criu/libnetlink.c:54): -16 reported by netlink: Device or resource busy'
b"(00.030199)     59: Error (criu/util.c:1413): Can't wait or bad status: errno=0, status=65280"
b'(00.030233) Error (criu/cr-restore.c:2547): Restoring FAILED.'
------------------------ ERROR OVER ------------------------
######### Test zdtm/static/socket-tcp-nfconntrack FAIL at CRIU restore #########
Test output: ================================

@Snorch
Copy link
Member

Snorch commented May 22, 2023

  1. Please try to apply those fixes to your patches https://github.com/Snorch/criu/commits/warasadura-criu-ns-test-v7-fixes Hope it will work.

  2. Please fix lineter errors in the above.

  3. Please revert pathlib change as there is no pathlib on centos7 (or fix in any other reasonable way)

@warusadura
Copy link
Member Author

  1. Please try to apply those fixes to your patches https://github.com/Snorch/criu/commits/warasadura-criu-ns-test-v7-fixes Hope it will work.

    1. Please fix lineter errors in the above.

    2. Please revert pathlib change as there is no pathlib on centos7 (or fix in any other reasonable way)

Will do, thanks @Snorch

--criu-binary argument provides a way to supply the CRIU binary
location to run_criu().

Related to: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
These changes remove and update the changes introduced in
7177938 in favor of the
Python version in CI.

os.waitstatus_to_exitcode() function appeared in Python 3.9

Related to: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
@warusadura warusadura force-pushed the issue#1909 branch 2 times, most recently from 4545696 to d673805 Compare May 24, 2023 00:41
test/others/criu-ns/run.py Fixed Show fixed Hide fixed
@Snorch
Copy link
Member

Snorch commented May 26, 2023

CentOS 7 based test

Test criu-ns dump and restore in namespaces
['../../../scripts/criu-ns', 'dump', '-D', 'dumpdir', '-v4', '-t', '49', '--log-file', 'dump.log', '--criu-binary', '../../../criu/criu']
['../../../scripts/criu-ns', 'restore', '-D', 'dumpdir', '-v4', '--log-file', 'restore.log', '--criu-binary', '../../../criu/criu', '--restore-detached', '--pidfile', 'pidfile']
Traceback (most recent call last):
  File "../../../scripts/criu-ns", line 335, in <module>
    res = wrap_dump()
  File "../../../scripts/criu-ns", line 302, in wrap_dump
    set_pidns(pid, pid_idx)
  File "../../../scripts/criu-ns", line 263, in set_pidns
    raise OSError(errno.ENOENT, 'Cannot find NSpid field in proc')

We get this error on second dump where criu-ns script tries to get right pid of task in pidns. This functionality can't work without NSpid: feature, so we need to skip this testcase on Centos 7.

With something like:

diff --git a/test/others/criu-ns/run.py b/test/others/criu-ns/run.py
index faf4d441d..6967b46b2 100755
--- a/test/others/criu-ns/run.py
+++ b/test/others/criu-ns/run.py
@@ -193,6 +193,9 @@ PIDFILE = "pidfile"
 
 
 def test_dump_and_restore_in_pidns():
+    if os.system("grep NSpid /proc/self/status"):
+        return
+
     print("Test criu-ns dump and restore in namespaces")
 
     def _dump():

Except that it looks ready to merge for me.

These changes add test implementations for criu-ns script.

Fixes: checkpoint-restore#1909

Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
These changes fix the `ImportError: No module named pathlib`
error when executing criu-ns tests located at criu/test/others/criu-ns

Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
CentOS 7 CI environment uses Python 2. To execute criu-ns
script in CentOS 7 changing the current shebang line to
python is required.

This reverse the changes made in a15a63f

Signed-off-by: Dhanuka Warusadura <csx@tuta.io>
@warusadura
Copy link
Member Author

CentOS 7 based test

Test criu-ns dump and restore in namespaces
['../../../scripts/criu-ns', 'dump', '-D', 'dumpdir', '-v4', '-t', '49', '--log-file', 'dump.log', '--criu-binary', '../../../criu/criu']
['../../../scripts/criu-ns', 'restore', '-D', 'dumpdir', '-v4', '--log-file', 'restore.log', '--criu-binary', '../../../criu/criu', '--restore-detached', '--pidfile', 'pidfile']
Traceback (most recent call last):
  File "../../../scripts/criu-ns", line 335, in <module>
    res = wrap_dump()
  File "../../../scripts/criu-ns", line 302, in wrap_dump
    set_pidns(pid, pid_idx)
  File "../../../scripts/criu-ns", line 263, in set_pidns
    raise OSError(errno.ENOENT, 'Cannot find NSpid field in proc')

We get this error on second dump where criu-ns script tries to get right pid of task in pidns. This functionality can't work without NSpid: feature, so we need to skip this testcase on Centos 7.

With something like:

diff --git a/test/others/criu-ns/run.py b/test/others/criu-ns/run.py
index faf4d441d..6967b46b2 100755
--- a/test/others/criu-ns/run.py
+++ b/test/others/criu-ns/run.py
@@ -193,6 +193,9 @@ PIDFILE = "pidfile"
 
 
 def test_dump_and_restore_in_pidns():
+    if os.system("grep NSpid /proc/self/status"):
+        return
+
     print("Test criu-ns dump and restore in namespaces")
 
     def _dump():

Except that it looks ready to merge for me.

understood, thanks @Snorch

@rst0git rst0git merged commit ae9402d into checkpoint-restore:criu-dev May 28, 2023
@warusadura warusadura deleted the issue#1909 branch June 26, 2023 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

We need test(s) for criu-ns script
6 participants