Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility with JVM checkpoint restore (OpenJDK's Project CRaC) #29921

Closed
jhoeller opened this issue Feb 3, 2023 · 2 comments
Closed

Compatibility with JVM checkpoint restore (OpenJDK's Project CRaC) #29921

jhoeller opened this issue Feb 3, 2023 · 2 comments
Assignees
Labels
in: core Issues in core modules (aop, beans, core, context, expression) type: enhancement A general enhancement
Milestone

Comments

@jhoeller
Copy link
Contributor

jhoeller commented Feb 3, 2023

Project CRaC introduces a mechanism for taking a JVM checkpoint snapshot (typically after startup) and then restoring from that checkpoint image for further deployment purposes, reducing the startup time.

Spring Boot on Tomcat is a target scenario for CRaC already. Spring applications are natural candidates for checkpoints after startup (plus some warming up through initial requests).

A couple of specific requirements need to be addressed: in particular the closing of file handles and network connections at checkpoint time plus subsequent restoring of those handles, as well as the refreshing of cached host metadata in a restored JVM. CRaC provides a Resource API for registering corresponding beforeCheckpoint/afterRestore callbacks.

From the Spring Framework side, we intend to revisit our Lifecycle contract where the existing stop/start mechanism implies the suspension of application-internal async processing and messaging resources already. We could narrow those semantics so that stop/start becomes a good citizen in a checkpoint/restore scenario, implying CRaC-compatible handling of resources in Spring-managed beans. This can then be triggered through a single ConfigurableApplicationContext.stop/start call which propagates to all contained beans, e.g. as part of a central CRaC Resource adapter in Spring Boot.

@jhoeller jhoeller added in: core Issues in core modules (aop, beans, core, context, expression) type: enhancement A general enhancement labels Feb 3, 2023
@jhoeller jhoeller added this to the 6.1.x milestone Feb 3, 2023
@jhoeller jhoeller self-assigned this Feb 3, 2023
@jhoeller jhoeller modified the milestones: 6.1.x, 6.1.0-M1 Mar 30, 2023
@tzolov
Copy link

tzolov commented Apr 13, 2023

I've been testing CRaC in the context of Spring and Spring Integration.

For the tests I've put together a generic CRaCAdapter - autoconfiguration, that internally leverages the ConfigurableApplicationContext.stop/start and build a CRaC container Image with preinstalled Ubuntu 22.04 and latest CRaC JVM.
Pre-build version of the image is also available at: tzolov/java_17_crac:latest.

Then I've tried the CRaCAdapter with few existing SI samples:

  • file-split-ftp

    The run instructions show how to run the application, create a checkpoint and then re-run from the restored checkpoint.

    It appears to work as expected. Apart of the embedded tomcat issue it works fine when Tomcat is replaced by Jetty.

  • kafka-dsl
    Repeating the same test with this long-running application reveals an important limitation about current CRaC implementation! Currently CRaC does not provide any mechanism to coordinate multiple threads.
    As a result when restoring from a checkpoint, CRaC will start the main thread before the Resource afterRestore methods have completed.
    For the kafka-dsl sample, the restored application will start trying to send Kafka messages before the afterRestore has completed, e.g. the Spring context hasn't started yet and Kafka connections haven't been reestablished.
    Expectedly this fails.

    I started a related discussion on the CRaC mailing list ( here is a sample crac-demo to illustrate the issue).
    Radim and Dan responses are very interesting, though a bit beyond my debt.

    Also as a result of the discussion this PR has been submitted: RCU Lock - RW lock with very lightweight read- and heavyweight write-locking openjdk/crac#58
    The RCULock is an option to try to ensure safe checkpoint creation/restoration but still imposes application modifications and it is not without performance cost.

@rishiraj88
Copy link

Thanks, @tzolov , for the descriptive comment. It's quite comprehensive and useful when perused.

@jhoeller jhoeller changed the title Compatibility with JVM snapshots (OpenJDK's Project CRaC) Compatibility with JVM checkpoint restore (OpenJDK's Project CRaC) May 11, 2023
sdeleuze added a commit to sdeleuze/spring-framework that referenced this issue May 12, 2023
This commit:
 - Refine the wording used in logs
 - Avoid calling awaitPreventShutdownBarrier() in afterRestore()
 - Add logs to print the restart duration

See spring-projectsgh-29921
sdeleuze added a commit to sdeleuze/spring-framework that referenced this issue May 12, 2023
This commit:
 - Refine the wording used in logs and Javadoc
 - Avoid calling awaitPreventShutdownBarrier() in afterRestore()
 - Add logs to print the restart duration

See spring-projectsgh-29921
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: core Issues in core modules (aop, beans, core, context, expression) type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

4 participants