Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

travis/openj9: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero. #887

Closed
avagin opened this issue Dec 19, 2019 · 8 comments

Comments

@avagin
Copy link
Member

avagin commented Dec 19, 2019

-------------------------------------------------------
3499 T E S T S
3500-------------------------------------------------------
3501Running TestSuite
3502Tests are to be run as a privileged user having capabilities mentioned in ReadMe
3503Logs directory does not exist, creating it
3504Output directory does not exist, creating it
3505======= Testing FileRead ========
3506Creating the test folder
3507Starting the java Test
3508Checkpointing process 94@4e60b79e891b
3509======= Testing ReadWrite ========
3510Creating the test folder
3511Starting the java Test
3512Checkpointing process 129@4e60b79e891b
3513======= Testing MemoryMappings ========
3514Creating the test folder
3515Starting the java Test
3516Checkpointing process 161@4e60b79e891b
3517======= Testing MultipleFileRead ========
3518Creating the test folder
3519Starting the java Test
3520Checkpointing process 193@4e60b79e891b
3521======= Testing MultipleFileWrite ========
3522Creating the test folder
3523Starting the java Test
3524Checkpointing process 225@4e60b79e891b
3525Tests run: 5, Failures: 5, Errors: 0, Skipped: 0, Time elapsed: 3.697 sec <<< FAILURE!
3526runtest(org.criu.java.tests.CheckpointRestore)  Time elapsed: 0.627 sec  <<< FAILURE!
3527java.lang.AssertionError: FileRead: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
3528For more details check dump.log in output/FileRead191218093501
3529	at org.testng.Assert.fail(Assert.java:89)
3530	at org.criu.java.tests.CheckpointRestore.checkpoint(CheckpointRestore.java:392)
3531	at org.criu.java.tests.CheckpointRestore.runtest(CheckpointRestore.java:206)
3532
3533runtest(org.criu.java.tests.CheckpointRestore)  Time elapsed: 0.593 sec  <<< FAILURE!
3534java.lang.AssertionError: ReadWrite: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
3535For more details check dump.log in output/ReadWrite191218093501
3536	at org.testng.Assert.fail(Assert.java:89)
3537	at org.criu.java.tests.CheckpointRestore.checkpoint(CheckpointRestore.java:392)
3538	at org.criu.java.tests.CheckpointRestore.runtest(CheckpointRestore.java:206)
3539
3540runtest(org.criu.java.tests.CheckpointRestore)  Time elapsed: 0.615 sec  <<< FAILURE!
3541java.lang.AssertionError: MemoryMappings: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
3542For more details check dump.log in output/MemoryMappings191218093502
3543	at org.testng.Assert.fail(Assert.java:89)
3544	at org.criu.java.tests.CheckpointRestore.checkpoint(CheckpointRestore.java:392)
3545	at org.criu.java.tests.CheckpointRestore.runtest(CheckpointRestore.java:206)
3546
3547runtest(org.criu.java.tests.CheckpointRestore)  Time elapsed: 0.689 sec  <<< FAILURE!
3548java.lang.AssertionError: MultipleFileRead: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
3549For more details check dump.log in output/MultipleFileRead191218093503
3550	at org.testng.Assert.fail(Assert.java:89)
3551	at org.criu.java.tests.CheckpointRestore.checkpoint(CheckpointRestore.java:392)
3552	at org.criu.java.tests.CheckpointRestore.runtest(CheckpointRestore.java:206)
3553
3554runtest(org.criu.java.tests.CheckpointRestore)  Time elapsed: 0.511 sec  <<< FAILURE!
3555java.lang.AssertionError: MultipleFileWrite: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
3556For more details check dump.log in output/MultipleFileWrite191218093503
3557	at org.testng.Assert.fail(Assert.java:89)
3558	at org.criu.java.tests.CheckpointRestore.checkpoint(CheckpointRestore.java:392)
3559	at org.criu.java.tests.CheckpointRestore.runtest(CheckpointRestore.java:206)
3560
3561
3562Results :
3563
3564Failed tests: 
3565  CheckpointRestore.runtest:206->checkpoint:392 FileRead: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
3566For more details check dump.log in output/FileRead191218093501
3567  CheckpointRestore.runtest:206->checkpoint:392 ReadWrite: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
3568For more details check dump.log in output/ReadWrite191218093501
3569  CheckpointRestore.runtest:206->checkpoint:392 MemoryMappings: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
3570For more details check dump.log in output/MemoryMappings191218093502
3571  CheckpointRestore.runtest:206->checkpoint:392 MultipleFileRead: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
3572For more details check dump.log in output/MultipleFileRead191218093503
3573  CheckpointRestore.runtest:206->checkpoint:392 MultipleFileWrite: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
3574For more details check dump.log in output/MultipleFileWrite191218093503
3575
3576Tests run: 5, Failures: 5, Errors: 0, Skipped: 0

https://travis-ci.org/checkpoint-restore/criu/jobs/626600452?utm_medium=notification&utm_source=github_status

@avagin
Copy link
Member Author

avagin commented Dec 19, 2019

@nidhi1605 could you look at this?

@avagin avagin changed the title travis/openjdk: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero. travis/openj9: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero. Dec 19, 2019
@nidhi1605
Copy link
Contributor

I ran the Java Functional tests on docker (kernel version: 4.15.0-72) and the tests passed.
And when I run the tests directly(not inside a docker container) on another system(kernel version: 5.0.0-37-generic) they pass again. However when I run them inside a docker container the checkpoint fails with error:

(00.095439) Collecting mappings (pid: 124)
(00.095444) ----------------------------------------
(00.095598) Error (criu/files-reg.c:1335): Can't lookup mount=705 for fd=-3 path=/opt/java/openjdk/bin/java
(00.095611) Error (criu/cr-dump.c:1244): Collect mappings (pid: 124) failed with -1
(00.095671) Unlock network
(00.095678) Unfreezing tasks into 1
(00.095682)     Unseizing 124 into 1
(00.098833) Error (criu/cr-dump.c:1753): Dumping FAILED.

In the travis jobs also, the tests started failing after the runtime kernel version changed from 4.15.0-1040-gcp to 5.0.0-1026-gcp.

@adrianreber
Copy link
Member

Who is responsible for the kernel 5.0.0-1026-gcp? Is this an Ubuntu kernel or is it from Google? We have seen other errors with kernels from Ubuntu which seem to be related to some overlay patches?

@nidhi1605 Do you the possibility to test this?

@rst0git
Copy link
Member

rst0git commented Dec 19, 2019

@nidhi1605 when running the tests on Fedora 31 (with sssd.service running) I get the following error:

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running TestSuite
Tests are to be run as a privileged user having capabilities mentioned in ReadMe
======= Testing FileRead ========
Starting the java Test
Checkpointing process 655794@rst
======= Testing ReadWrite ========
Starting the java Test
Checkpointing process 655826@rst
======= Testing MemoryMappings ========
Starting the java Test
Checkpointing process 655843@rst
======= Testing MultipleFileRead ========
Starting the java Test
Checkpointing process 655860@rst
======= Testing MultipleFileWrite ========
Starting the java Test
Checkpointing process 655877@rst
Tests run: 5, Failures: 5, Errors: 0, Skipped: 0, Time elapsed: 2.181 sec <<< FAILURE!
runtest(org.criu.java.tests.CheckpointRestore)  Time elapsed: 0.479 sec  <<< FAILURE!
java.lang.AssertionError: FileRead: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
For more details check dump.log in output/FileRead191219113817
	at org.testng.Assert.fail(Assert.java:89)
	at org.criu.java.tests.CheckpointRestore.checkpoint(CheckpointRestore.java:392)
	at org.criu.java.tests.CheckpointRestore.runtest(CheckpointRestore.java:206)

runtest(org.criu.java.tests.CheckpointRestore)  Time elapsed: 0.303 sec  <<< FAILURE!
java.lang.AssertionError: ReadWrite: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
For more details check dump.log in output/ReadWrite191219113817
	at org.testng.Assert.fail(Assert.java:89)
	at org.criu.java.tests.CheckpointRestore.checkpoint(CheckpointRestore.java:392)
	at org.criu.java.tests.CheckpointRestore.runtest(CheckpointRestore.java:206)

runtest(org.criu.java.tests.CheckpointRestore)  Time elapsed: 0.341 sec  <<< FAILURE!
java.lang.AssertionError: MemoryMappings: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
For more details check dump.log in output/MemoryMappings191219113818
	at org.testng.Assert.fail(Assert.java:89)
	at org.criu.java.tests.CheckpointRestore.checkpoint(CheckpointRestore.java:392)
	at org.criu.java.tests.CheckpointRestore.runtest(CheckpointRestore.java:206)

runtest(org.criu.java.tests.CheckpointRestore)  Time elapsed: 0.319 sec  <<< FAILURE!
java.lang.AssertionError: MultipleFileRead: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
For more details check dump.log in output/MultipleFileRead191219113818
	at org.testng.Assert.fail(Assert.java:89)
	at org.criu.java.tests.CheckpointRestore.checkpoint(CheckpointRestore.java:392)
	at org.criu.java.tests.CheckpointRestore.runtest(CheckpointRestore.java:206)

runtest(org.criu.java.tests.CheckpointRestore)  Time elapsed: 0.309 sec  <<< FAILURE!
java.lang.AssertionError: MultipleFileWrite: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
For more details check dump.log in output/MultipleFileWrite191219113818
	at org.testng.Assert.fail(Assert.java:89)
	at org.criu.java.tests.CheckpointRestore.checkpoint(CheckpointRestore.java:392)
	at org.criu.java.tests.CheckpointRestore.runtest(CheckpointRestore.java:206)


Results :

Failed tests: 
  CheckpointRestore.runtest:206->checkpoint:392 FileRead: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
For more details check dump.log in output/FileRead191219113817
  CheckpointRestore.runtest:206->checkpoint:392 ReadWrite: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
For more details check dump.log in output/ReadWrite191219113817
  CheckpointRestore.runtest:206->checkpoint:392 MemoryMappings: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
For more details check dump.log in output/MemoryMappings191219113818
  CheckpointRestore.runtest:206->checkpoint:392 MultipleFileRead: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
For more details check dump.log in output/MultipleFileRead191219113818
  CheckpointRestore.runtest:206->checkpoint:392 MultipleFileWrite: ERROR: Error during checkpoint: exitCode of checkpoint process was not zero.
For more details check dump.log in output/MultipleFileWrite191219113818

Tests run: 5, Failures: 5, Errors: 0, Skipped: 0
$ sudo tail output/FileRead191219113817/dump.log 
(00.074040) cg: All tasks in criu's cgroups. Nothing to dump.
(00.074043) unix: Dumping external sockets
(00.074069) unix: 	Dumping extern: ino 2924159 peer_ino 2924158 family    1 type    1 state  1 name /var/lib/sss/pipes/nss
(00.074076) unix: 	Dumped extern: id 0x2f ino 2924159 peer 0 type 2 state 10 name 23 bytes
(00.074079) unix: 	Runaway socket: ino 2924159 peer_ino 2924158 family    1 type    1 state  1 name /var/lib/sss/pipes/nss
(00.074082) Error (criu/sk-unix.c:808): unix: External socket is used. Consider using --ext-unix-sk option.
(00.074116) Unlock network
(00.074121) Unfreezing tasks into 1
(00.074123) 	Unseizing 655794 into 1
(00.074203) Error (criu/cr-dump.c:1771): Dumping FAILED.

However, when I stop sssd (with systemctl stop sssd) all tests pass.

@nidhi1605
Copy link
Contributor

nidhi1605 commented Dec 19, 2019

Running the tests on Runtime kernel version: 5.0.0-1026-gcp, the dump.log shows:

(00.170772) netlink: Collect netlink sock 0xcac1
(00.170774) netlink: Collect netlink sock 0xcac2
(00.170840) ========================================
(00.170844) Dumping task (pid: 101)
(00.170846) ========================================
(00.170848) Obtaining task stat ... 
(00.170881) 
(00.170884) Collecting mappings (pid: 101)
(00.170886) ----------------------------------------
(00.171020) Error (criu/files-reg.c:1338): Can't lookup mount=206 for fd=-3 path=/opt/java/openjdk/bin/java
(00.171029) Error (criu/cr-dump.c:1243): Collect mappings (pid: 101) failed with -1
(00.171071) Unlock network
(00.171075) Unfreezing tasks into 1
(00.171078) 	Unseizing 101 into 1
(00.171165) Error (criu/cr-dump.c:1771): Dumping FAILED

https://travis-ci.org/nidhi1605/criu/jobs/627344240#L2644

@adrianreber
Copy link
Member

This seems to be related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1857257

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@adrianreber
Copy link
Member

Fixed (better: worked around) with commit 79559be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants