-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subprocess report generation #286
Subprocess report generation #286
Conversation
ac90a8d
to
fd21a4d
Compare
The generator can connect back to its own container when running in OpenShift, but this still isn't working when using Still needs some cleanup work, then 1) the subprocess max heap size should be determined by some runtime heuristics from the parent ContainerJFR and 2) the subprocess needs some way to be informed of and load the set of report transformers to apply |
Part 1 probably can't be properly addressed until we can switch to a new base image with JDK 11.0.9+ : https://bugs.openjdk.java.net/browse/JDK-8226575 . Once we are on that base then the OperatingSystemMXBean can tell us how much headroom we have for report generation at the time of the request. |
9b2bd69
to
476957c
Compare
Report transformation handling is now implemented. The parent process serializes the set of transformers (as a newline-delimited list of fully-qualified transformer class names) into a file, which the subprocess then reads, deserializes, and recreates all of the transformers by loading the classes from the classpath and instantiating the default no-args constructor. The subprocess then truncates and reuses this file to write out the report contents before exiting. |
|
8d2d64c
to
23f9596
Compare
|
23f9596
to
9f7f939
Compare
Just noticed that this fails integration tests:
I'm looking into it. |
Bisection shows the first bad commit appears to be |
Here are the relevant ContainerJFR logs:
|
Figured it out. The handler for archived recordings extends the handler for active recordings, so it ends up creating a |
dfa32dc
to
fdc3878
Compare
Since I've added the |
1aeff5e
to
fb8252a
Compare
Is there a good way to test the scenario where the report is too large and causes the process to terminate? Does something like this work?
In this case, I can see the exception in the logs and in the response body, but the response status code is a 200. Also, is it a problem that the handler doesn't work for me when I run cjfr as a local process (but it works fine if I just use run.sh)?
|
Yea, you can do that. The other alternative would be to set the The 200 response code is intentional, at least for now. The code will be 200 but the response body should be some error message, rather than an HTML report. This is because of how the caching system works - if I throw an exception from the method that produces the report string then that result is not cached, and so subsequent requests will retry the same generation. Reports tend to increase in size with time, or at least stay the same size, so if a previous report generation has failed, then a retry is likely to fail as well, so the failure should be cached.
I don't think that really matters. We're targeting OpenShift/Kubernetes first, then general containerized workloads, and the local JVM process case isn't really an intended use case. We're working toward packaging and distribution of an Operator and container images but nothing like an RPM where the local process scenario would need to be supported.
As I documented above however, even when using |
This may be used in cases where a ConnectionDescriptor is used to represent contextual Credentials from an HTTP request for a non-target-oriented action or resource
This reverts commit fb8252a. If the subprocess does not exit due to exceeding the specified heap size then the exit status will not be set as expected, and the parent process will not know why the subprocess failed
948343e
to
09566b7
Compare
Open target connection and copy recording stream to file, then close target connection before proceeding with report generation. This holds the target connection open for the least time possible
Okay, I'm done tinkering with this now. Any further changes beyond review adjustments will be made in separate follow-up PRs. |
This isn't only on the recent commits, but I noticed that for archived recordings, the first time I request the report, I will see "Not found" in the body, but for all subsequent times, I don't see anything in the body. Is this an issue? Also, in general, after the first time I request an archived recording's report, I don't see a log for the |
Hmm. I'm not able to reproduce the For the missing request logs, I think this comes from the |
To run cjfr, I did Then,
In the screenshot, at the end, I also tried using the |
Ah, this might be specifically to do with the OOM handling then - I missed that part. Let me try it again with that. |
Should be fixed now. I'm still not too happy with the fact that these errors are reported along with a |
2b802f0
to
6273823
Compare
Write error message to cached file and serve this to clients. This brings the behaviour in line with the failure handling for active recordings.
6273823
to
01c3807
Compare
* Proof-of-concept Subprocess report generation so that if report requires too much memory, the subprocess can be killed rather than allowing the whole container to be taken down. This can be done by the parent process setting a smaller maximum heap size for the subprocess (not yet configurable). The subprocess is not able to connect back to container-jfr itself currently. * Remove redundant parent process work performed by subprocess * Limit maximum subprocess generation time * Use Logger rather than raw System.out/err * Spotless format * Subprocess doesn't need to listen for incoming JMX connections * fixup! Use Logger rather than raw System.out/err * Better handling of subprocess timeout * Update TODOS * Write transformer set to file used for writing report * Apply spotless formatting * Refactor to use Builder with configurable env * Pass optional credentials via env vars * Simplify jvm arg list creation * Use EpsilonGC for subprocess * Add TODO * Refactor cleanup * Process archived recording reports in subprocess * Fix tests * Get truststore values from env, not filesystem * Make /truststore dir configurable * Use existing FileSystem local var * Return Future and clean up clones Return Future to allow better async handling of long-running subprocess report generation. Close recording streams when subprocess successfully generates reports to avoid dangling clone recordings, and if subprocess exits due to OOM, clean up dangling clones from parent process * Refactor, add nonnull assertions, add tests * Set subprocess OOM score adjustment if supported * Allow null targetIds This may be used in cases where a ConnectionDescriptor is used to represent contextual Credentials from an HTTP request for a non-target-oriented action or resource * Address spotbugs violations * Report target connection failure more accurately * Remove subprocess max heap size * Log elapsed subprocess run time * Revert "Remove subprocess max heap size" This reverts commit fb8252a. If the subprocess does not exit due to exceeding the specified heap size then the exit status will not be set as expected, and the parent process will not know why the subprocess failed * Ensure subprocess is terminated * Add env var to configure subprocess max heap * Document possible error message response on report generation * Document CONTAINER_JFR_REPORT_GENERATION_MAX_HEAP * Set subprocess timeout equal to HTTP timeout * Copy recording to file temporarily Open target connection and copy recording stream to file, then close target connection before proceeding with report generation. This holds the target connection open for the least time possible * Improve archived report generation failure handling Write error message to cached file and serve this to clients. This brings the behaviour in line with the failure handling for active recordings.
Fixes #8
Subprocess report generation so that if report requires too much memory, the subprocess can be killed rather than allowing the whole container to be taken down. This can be done by the parent process setting a smaller maximum heap size for the subprocess - currently it just assumes a value of 200M, but some heuristic can be applied here to vary the size according to the container limits. The subprocess is not yet able to connect back to container-jfr itself currently, probably due to some kind of JMX/RMI/SSL misconfiguration.
This definitely increases latency of report generation, but with greatly reduced risk of crashing the whole ContainerJFR instance, so it seems to be a worthwhile tradeoff. Especially since we already have the report caching layer in front of this, so only the first load will be slower due to the subprocess fork/await.