-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(report): add granular CPU/mem configurability #7
Conversation
8878a19
to
7f188d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I packaged this branch and ran a smoketest instance of https://github.com/cryostatio/cryostat/pull/779, report generation worked when uploading the ~10MB file Janelle linked (https://github.com/cryostatio/cryostat/pull/779#discussion_r766944173) that was previously failing for her.
Any chance you could link the ~150MB file you mentioned in the corresponding issue #5, perhaps through Google Drive? I'd like to test out the PR using that as well. |
I'll email you a link to it. |
I spoke too soon, Janelle's 10MB one and the ~150MB one you sent me are still exhibiting the above behaviour. I originally just tested the dropdown report for the 10MB recording and not View Report in a new tab. As for the ~480MB file, both the View Report and dropdown don't work, unfortunately. My workflow was to upload the files to archives and for each one do: View Report --> Dropdown --> View Report. I did this in order of smallest to largest. For the 10MB and 150MB files the result was: Failure --> Success --> Success. For the 480MB file the result was: Failure --> Failure --> Failure. See: |
The
I've been trying to figure out where all the memory usage is going because it does seem like the sidecar -reports container should be able to handle larger recording files than it currently does, but I haven't had too much success yet. That cryostatio/cryostat#779 PR is set up so that -reports has JDP and JMX enabled, so you can interact with it as any other target and use Cryostat to collect a Profiling JFR recording from the reports sidecar. You can try increasing the CPU and memory allocation given to the container by editing those parameters in the You can also use the |
I tried opening the 150MB .jfr file using JMC, while also running a Profiling recording on JMC itself, and found that the heap memory usage there is also dominated by allocations that stem from the report generation. It ends up looking very similar to the -reports allocations and heap usage. So, I think the 150MB file or any larger ones just really do take quite a lot of memory to process. Opening it in JMC normally makes it seem pretty quick and easy but that's because JMC as a desktop application is able to use all of your workstation's CPU cores and can allocate a lot of heap space (probably about 1/4 of your workstation's RAM) by default, and these resource allocations completely dwarf the limits we're testing -reports with. |
$ cd workspace/jmc
$ $EDITOR ./target/products/org.openjdk.jmc/linux/gtk/x86_64/jmc.ini
# add "-Xms512M", "-Xmx512M", "-XX:ActiveProcessorCount=1" to the flags
$ ./target/products/org.openjdk.jmc/linux/gtk/x86_64/jmc I tried this out to launch the JMC desktop application with similar constraints that we apply to cryostat-reports (512M memory, 1 CPU core), then tried opening the 150MB .jfr with it. JMC also failed to open it, although JMC's failure handling is nice and graceful: It seems to recognize that the recording is too large, and then presents a range selector to allow you to choose a narrower time window to try to analyze instead. After a couple of trials it seems the width of the bar corresponds to how much memory is currently available to JMC, so there appears to be some heuristic that JMC applies here to estimate how large of a recording it can handle given its memory constraints. If there is such a utility in the JMC code somewhere doing this approximation then it would be very neat if we could reuse that and expose it in the cryostat-core report generator. Then we can reject requests for recordings that we estimate to be too large, rather than wasting resources trying to process the recording and ending in an OOM or TimeoutException. |
That makes sense, I like the idea of applying a rejection heuristic based on recording size. I've been playing around with Do you have an idea as to why the reports container can smoothly process all three recordings over a simple HTTP setup but for some reason when clicking View Report with the full Cryostat setup it doesn't work? I guess that's more of an issue with the Cryostat backend/frontend since this container does its job well in isolation. |
That sounds to me like you're running into the HTTP timeout. Cryostat will only wait for up to 29 seconds to get a response from the report sidecar. Any longer and it cancels the request and reports a failure status back to the requesting client. The difference in behaviour for the dropdown vs View Report is still a bit strange sounding and I'll have to take a deeper look into that again, but that's going to be on the main Cryostat backend side of things. |
Here, roughly, is the entrypoint into where JMC displays that little range selection window after it determines that the given recording is too large. Looks like the logic used for determining whether the recording is too large is here: |
I've added some logic to apply a configurable memory factor heuristic so that the service can bail out early if it thinks the provided recording is too large to process, so for that reason I have also updated this PR to close #6 - it's still possible to OOM, especially if the memory factor configuration is set to a low number, but I think this is about the best we can do. I have also added some logic to check if the uploaded file was compresed, and if so, to first decompress it to disk before trying to process it. This should keep compressed recordings from exerting additional memory pressure, since the decompression doesn't also need to be done streaming in-memory while also processing the contents of the recording. Finally, the elapsed time taken to decompress the file (if any) is checked, and if this takes too long then a |
To create a compressed recording I simply did ex. |
reduce memory pressure for compressed recordings by not handling both decompression and report generation at once in memory
respond with HTTP 413 (Request Entity Too Large) if the decompressed filesize exceeds available memory divided by a configurable memory factor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to generate a |
I tried that configuration out with a 10 MB recording and unfortunately, it still isn't enough to trigger a It seems the heuristic is doing its job well but I wonder in what conditions a |
Oh sorry, I read/wrote Another way to cause a 504 to happen could be to allocate a large amount of memory but a very small amount of CPU time. I just tested that out by modifying I think in real practical terms this is when we would most likely see 504s manifesting - when the -reports deployment has essentially overprovisioned memory for the amount of CPU available. It's also possible they could occur when the available memory is just barely enough to process the requested recording, but the JVM gets stuck with very little free allocation room between rules evaluations and ends up spending significant time on GC cycles (GC thrashing). This might be possible for smaller specific memory size/recording size with a single CPU core (even a whole one), particularly if the JVM is running with the Serial GC. |
Anyway, that last minor unit test addition was the last, last-minute change I thought of for this PR. I won't make any further alterations unless something comes up again in review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, that CPU configuration gave me a 504
during report generation. I checked over the code and it looks good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, that CPU configuration gave me a 504 during report generation.
Same here. The 413 is working well too.
Tests also look good.
Fixes #5
Fixes #6