Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InstantOn: Unable to interrupt task: 205 (Operation not permitted) #352

Closed
bmarwell opened this issue Oct 27, 2022 · 10 comments
Closed

InstantOn: Unable to interrupt task: 205 (Operation not permitted) #352

bmarwell opened this issue Oct 27, 2022 · 10 comments

Comments

@bmarwell
Copy link

bmarwell commented Oct 27, 2022

Hey everyone!

First of all thanks for InstantOn. I think it is a GREAT technology!

So first of all, I prepared my application to use CRIU: https://github.com/bmarwell/openliberty-content-negotiation-example/tree/54faec642e27d0f12c976f12f68ccc292b0632cd
I used this guide: https://openliberty.io/blog/2022/09/29/instant-on-beta.html#app-image

mvn package -Ddockerize

Image building works just fine. The original image is being built by the k8s-maven-plugin, so you only need to execute a few commands afterwards, mainly:

podman run --name olcr-checkpoint-container --privileged --env WLP_CHECKPOINT=applications openlibertycontentrenegotiation/olcr-app-ol-docker:latest | grep -E '^{' | jq -s

Now since this is JSON-logging, I added jq (omit if you would like to).
There are a few errors:

Performing checkpoint --at=applications

WARNING: Unknown module: jdk.management.agent specified to --add-exports
WARNING: Unknown module: jdk.attach specified to --add-exports
[
  {
    "type": "liberty_message",
    "host": "1b4e422ab50d",
    "ibm_userDir": "/opt/ol/wlp/usr/",
    "ibm_serverName": "defaultServer",
    "message": "CWWKC0451I: A server checkpoint was requested. When the checkpoint completes, the server stops.",
    "ibm_threadId": "00000031",
    "ibm_datetime": "2022-10-27T12:24:26.687+0000",
    "ibm_messageId": "CWWKC0451I",
    "module": "io.openliberty.checkpoint.internal.CheckpointImpl",
    "loglevel": "AUDIT",
    "ibm_sequence": "1666873466687_0000000000013",
    "ext_thread": "Default Executor-thread-1"
  },
  {
    "type": "liberty_ffdc",
    "host": "1b4e422ab50d",
    "ibm_userDir": "/opt/ol/wlp/usr/",
    "ibm_serverName": "defaultServer",
    "ibm_datetime": "2022-10-27T12:24:27.808+0000",
    "message": "Could not dump the JVM processs, err=-52",
    "ibm_className": "io.openliberty.checkpoint.internal.CheckpointImpl",
    "ibm_exceptionName": "io.openliberty.checkpoint.internal.criu.CheckpointFailedException",
    "ibm_probeID": "341",
    "ibm_threadId": "00000031",
    "ibm_stackTrace": "io.openliberty.checkpoint.internal.criu.CheckpointFailedException: Could not dump the JVM processs, err=-52\n\tat io.openliberty.checkpoint.internal.openj9.ExecuteCRIU_OpenJ9.dump(ExecuteCRIU_OpenJ9.java:61)\n\tat io.openliberty.checkpoint.internal.CheckpointImpl.checkpoint(CheckpointImpl.java:401)\n\tat io.openliberty.checkpoint.internal.CheckpointImpl.checkpointOrExitOnFailure(CheckpointImpl.java:340)\n\tat io.openliberty.checkpoint.internal.CheckpointImpl.check(CheckpointImpl.java:335)\n\tat java.base/java.util.ArrayList.forEach(Unknown Source)\n\tat com.ibm.ws.kernel.feature.internal.FeatureManager.checkServerReady(FeatureManager.java:823)\n\tat com.ibm.ws.kernel.feature.internal.FeatureManager.update(FeatureManager.java:786)\n\tat com.ibm.ws.kernel.feature.internal.FeatureManager.processFeatureChanges(FeatureManager.java:886)\n\tat com.ibm.ws.kernel.feature.internal.FeatureManager$1.run(FeatureManager.java:672)\n\tat com.ibm.ws.threading.internal.ExecutorServiceImpl$RunnableWrapper.run(ExecutorServiceImpl.java:245)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: org.eclipse.openj9.criu.SystemCheckpointException: Could not dump the JVM processs, err=-52\n\tat openj9.criu/org.eclipse.openj9.criu.CRIUSupport.checkpointJVMImpl(Native Method)\n\tat openj9.criu/org.eclipse.openj9.criu.CRIUSupport.checkpointJVM(Unknown Source)\n\tat io.openliberty.checkpoint.internal.openj9.ExecuteCRIU_OpenJ9.dump(ExecuteCRIU_OpenJ9.java:57)\n\t... 12 more\n",
    "ibm_objectDetails": "Object type = io.openliberty.checkpoint.internal.CheckpointImpl\n  SUPPORTED_FEATURES = class java.util.Collections$UnmodifiableSet@5c6a054e\n    serialVersionUID = \"/* Could not access serialVersionUID */\"\n    serialVersionUID = \"/* Could not access serialVersionUID */\"\n    c = \"/* Could not access c */\"\n  CHECKPOINT_STUB_CRIU = \"io.openliberty.checkpoint.stub.criu\"\n  CHECKPOINT_CRIU_UNPRIVILEGED = \"io.openliberty.checkpoint.criu.unprivileged\"\n  CHECKPOINT_ALLOWED_FEATURES = \"io.openliberty.checkpoint.allowed.features\"\n  CHECKPOINT_ALLOWED_FEATURES_ALL = \"ALL_FEATURES\"\n  CHECKPOINT_PAUSE_RESTORE = \"io.openliberty.checkpoint.pause.restore\"\n  HOOKS_REF_NAME_SINGLE_THREAD = \"hooksSingleThread\"\n  HOOKS_REF_NAME_MULTI_THREAD = \"hooksMultiThread\"\n  DIR_CHECKPOINT = \"checkpoint/\"\n  FILE_RESTORE_MARKER = \"checkpoint/.restoreMarker\"\n  FILE_RESTORE_FAILED_MARKER = \"checkpoint/.restoreFailedMarker\"\n  FILE_ENV_PROPERTIES = \"checkpoint/.env.properties\"\n  DIR_CHECKPOINT_IMAGE = \"checkpoint/image/\"\n  CHECKPOINT_LOG_FILE = \"checkpoint.log\"\n  tc = class com.ibm.websphere.ras.TraceComponent@3ff9b463\n    strings[0] = \"TraceComponent[io.openliberty.checkpoint.internal.CheckpointImpl,class io.openliberty.checkpoint.internal.CheckpointImpl,[checkpoint],io.openliberty.checkpoint.resources.CheckpointMessages,null]\"\n  allowedFeatures = class java.util.Collections$EmptySet@59cfdf86\n    serialVersionUID = \"/* Could not access serialVersionUID */\"\n  cc = class org.apache.felix.scr.impl.manager.ComponentContextImpl@3907de62\n    m_componentManager = class org.apache.felix.scr.impl.manager.SingleComponentManager@8649fcc6\n      m_useCount = class java.util.concurrent.atomic.AtomicInteger@14aec2dd\n      m_componentContext = class org.apache.felix.scr.impl.manager.ComponentContextImpl@3907de62\n      m_configurationProperties = class java.util.HashMap@6646af3e\n      m_factoryProperties = null\n      m_properties = class java.util.HashMap@507dac4c\n      m_serviceProperties = null\n      REASONS = class java.lang.String[7]\n      m_container = class ...",
    "ibm_sequence": "1666873467808_0000000000001"
  },
  {
    "type": "liberty_message",
    "host": "1b4e422ab50d",
    "ibm_userDir": "/opt/ol/wlp/usr/",
    "ibm_serverName": "defaultServer",
    "message": "CWWKC0453E: The server checkpoint request failed with the following message: Could not dump the JVM processs, err=-52",
    "ibm_threadId": "00000031",
    "ibm_datetime": "2022-10-27T12:24:27.854+0000",
    "ibm_messageId": "CWWKC0453E",
    "module": "io.openliberty.checkpoint.internal.CheckpointImpl",
    "loglevel": "ERROR",
    "ibm_sequence": "1666873467854_0000000000015",
    "ext_thread": "Default Executor-thread-1"
  },
  {
    "type": "liberty_message",
    "host": "1b4e422ab50d",
    "ibm_userDir": "/opt/ol/wlp/usr/",
    "ibm_serverName": "defaultServer",
    "message": "CWWKE0084I: The server defaultServer is stopping because thread Checkpoint failed, exiting... (0000004b) called the method java.lang.System.exit: \n\tat java.base/java.lang.System.exit(Unknown Source)\n\tat io.openliberty.checkpoint.internal.CheckpointImpl.lambda$checkpointOrExitOnFailure$1(CheckpointImpl.java:355)\n\tat java.base/java.lang.Thread.run(Unknown Source)\n",
    "ibm_threadId": "00000020",
    "ibm_datetime": "2022-10-27T12:24:27.857+0000",
    "ibm_messageId": "CWWKE0084I",
    "module": "com.ibm.ws.kernel.launch.internal.FrameworkManager",
    "loglevel": "AUDIT",
    "ibm_sequence": "1666873467857_0000000000017",
    "ext_thread": "WS-ShutdownHook"
  }
]

To see the contents of that checkpoint file, I committed the name and run cat /logs/checkpoint/checkpoint.log which yields:

Warn  (compel/src/lib/infect.c:126): Unable to interrupt task: 205 (Operation not permitted)
Error (criu/proc_parse.c:350): Failed to resolve mapping 55cf16c05000 filename
Error (criu/proc_parse.c:641): Can't open 134's mapfile link 55cf16c05000: Operation not permitted
Error (criu/cr-dump.c:1517): Collect mappings (pid: 134) failed with -1
Error (criu/cr-dump.c:2048): Dumping FAILED.

On my host machine I am running Manjaro Linux (similar to arch), I have podman and podman-docker installed, I started a podman service (as needed by k8s-maven-plugin, see env var DOCKER_HOST).

I hope I can help fix this problem before InstantOn gets out of beta.
Thanks!

Dockerfile: https://github.com/bmarwell/openliberty-content-negotiation-example/blob/54faec642e27d0f12c976f12f68ccc292b0632cd/app/openliberty/docker/Dockerfile
server.xml: https://github.com/bmarwell/openliberty-content-negotiation-example/blob/54faec642e27d0f12c976f12f68ccc292b0632cd/app/openliberty/docker/src/main/docker/config/server.xml

@leochr
Copy link
Member

leochr commented Oct 27, 2022

@tjwatson @mbroz2 could you please look into this? Thanks

@tjwatson
Copy link
Member

Performing checkpoint --at=applications

WARNING: Unknown module: jdk.management.agent specified to --add-exports
WARNING: Unknown module: jdk.attach specified to --add-exports

These you can safely ignore and should be fixed in the next beta.

Quick question. Are you doing this as a non-root user?

@bmarwell
Copy link
Author

Yes, as non-root user. My guess is that this is not an issue with the OL image and I did find some threads on the interweb pointing to some kernel changes.
I mostly followed the Arch Linux guidelines (https://wiki.archlinux.org/title/Podman) and installed the latest CRIU package from AUR. I also followed the rootless settings from the wiki.
Every other command I executed as user=1000 is written down in the branch's readme: https://github.com/bmarwell/openliberty-content-negotiation-example/tree/docker_with_criu#enable-open-liberty-instanton

@tjwatson
Copy link
Member

You should not need CRIU installed on the host system to use Liberty InstantOn with our container images.

I expect the issue is rootless podman cannot successfully grant the running container the necessary capabilities. It would be useful to know if this fails for you when not using rootless podman.

@bmarwell
Copy link
Author

Hi Thomas!

I can confirm that it works flawlessly when:

  1. Building without k8s-maven-plugin
  2. Then building the image as root using podman build --build-arg JKUBE_DEFAULT_ASSEMBLY=. . -t openlibertycontentrenegotiation/olcr-app-ol-docker:latest
  3. Run as root: podman run --name olcr-checkpoint-container --privileged --env WLP_CHECKPOINT=applications openlibertycontentrenegotiation/olcr-app-ol-docker:latest
  4. commit as root using podman commit olcr-checkpoint-container oclr-instanton
  5. start as root using podman run --privileged oclr-instanton:latest (add port if needed).

Result:

{"type":"liberty_message","host":"62e56eec1932","ibm_userDir":"\/opt\/ol\/wlp\/usr\/","ibm_serverName":"defaultServer","message":"CWWKZ0001I: Application olcr-web-restv1-1.0.0-SNAPSHOT started in 0.179 seconds.","ibm_threadId":"00000031","ibm_datetime":"2022-10-29T19:19:49.942+0000","ibm_messageId":"CWWKZ0001I","module":"com.ibm.ws.app.manager.AppMessageHelper","loglevel":"AUDIT","ibm_sequence":"1667071189942_0000000000014","ext_thread":"Default Executor-thread-1"}
{"type":"liberty_message","host":"62e56eec1932","ibm_userDir":"\/opt\/ol\/wlp\/usr\/","ibm_serverName":"defaultServer","message":"CWWKC0452I: The Liberty server process resumed operation from a checkpoint in 0.190 seconds.","ibm_threadId":"00000031","ibm_datetime":"2022-10-29T19:19:49.954+0000","ibm_messageId":"CWWKC0452I","module":"io.openliberty.checkpoint.internal.CheckpointImpl","loglevel":"AUDIT","ibm_sequence":"1667071189954_0000000000015","ext_thread":"Default Executor-thread-1"}
{"type":"liberty_message","host":"62e56eec1932","ibm_userDir":"\/opt\/ol\/wlp\/usr\/","ibm_serverName":"defaultServer","message":"CWWKF0012I: The server installed the following features: [cdi-2.0, checkpoint-1.0, jaxrs-2.1, jaxrsClient-2.1, jndi-1.0, jsonb-1.0, jsonp-1.1, monitor-1.0, servlet-4.0].","ibm_threadId":"00000031","ibm_datetime":"2022-10-29T19:19:49.969+0000","ibm_messageId":"CWWKF0012I","module":"com.ibm.ws.kernel.feature.internal.FeatureManager","loglevel":"AUDIT","ibm_sequence":"1667071189969_0000000000017","ext_thread":"Default Executor-thread-1"}
{"type":"liberty_message","host":"62e56eec1932","ibm_userDir":"\/opt\/ol\/wlp\/usr\/","ibm_serverName":"defaultServer","message":"CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 0.207 seconds.","ibm_threadId":"00000031","ibm_datetime":"2022-10-29T19:19:49.971+0000","ibm_messageId":"CWWKF0011I","module":"com.ibm.ws.kernel.feature.internal.FeatureManager","loglevel":"AUDIT","ibm_sequence":"1667071189971_0000000000019","ext_thread":"Default Executor-thread-1"}

Thanks!
I think this issue can be closed then. non-root support will probably be added later, I guess?

@tjwatson
Copy link
Member

5. start as root using podman run --privileged oclr-instanton:latest (add port if needed).

If on a kernel that has clone3 system call then you should be able to start (as root) without --privileged with something like this:

podman run \
  --cap-add=CHECKPOINT_RESTORE \
  --cap-add=NET_ADMIN \
  --cap-add=SYS_PTRACE \
  ...

If you system doesn't have clone3 then you need to mount ns_last_pid from proc:

podman run \
  --cap-add=CHECKPOINT_RESTORE \
  --cap-add=NET_ADMIN \
  --cap-add=SYS_PTRACE \
  -v /proc/sys/kernel/ns_last_pid:/proc/sys/kernel/ns_last_pid \

On RHEL podman seems to grant (by default) all the necessary system calls CRIU needs. But if you have trouble with the with the above it maybe that your system does not. In that case you need to use the --security-opt seccomp= option and pass a file that grants all the system calls required by CRIU to restore. See the last section of https://openliberty.io/blog/2022/09/29/instant-on-beta.html titled Running with an unprivileged container with confined security

Note that even though you are using podman as a root the restored Java process running in the container is not running as root.

I think this issue can be closed then. non-root support will probably be added later, I guess?

We can leave open while we investigate if rootless podman is possible. Somehow containers will need to be granted the above capabilities successfully when launched this way. We are looking to remove NET_ADMIN requirement, but CHECKPOINT_RESTORE and SYS_PTRACE are unavoidable for restoring the process.

@tjwatson
Copy link
Member

tjwatson commented Nov 1, 2022

Given how containers/podman#7866 was handled I do not expect us to be able to use rootless podman to restore in-container.

@bmarwell
Copy link
Author

Understood! Will revert to docket then. Thanks for the information.
I will close this issue, but feel free to reopen of you like to use it for documentation.

@bmarwell
Copy link
Author

bmarwell commented Mar 11, 2023

@tjwatson
Copy link
Member

The OpenJ9 team worked on getting the the support for cap_checkpoint_restore in to criu (see checkpoint-restore/criu#1930). That is what we are using to be able to restore the process in-container without needing to use a privileged container to run. The issue is podman rootless does not allow such capabilities to be elevated to a container from a non-root user.

When using the latest docker this is allowed when using the docker daemon. The latest docker release now has support for passing the cap_checkpoint_restore capability when running a container. So that is an option for non-root usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants