Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smoketest setup can OOM #869

Closed
andrewazores opened this issue Mar 18, 2022 · 9 comments · Fixed by #888
Closed

Smoketest setup can OOM #869

andrewazores opened this issue Mar 18, 2022 · 9 comments · Fixed by #888
Assignees
Labels
chore Refactor, rename, cleanup, etc. good first issue Good for newcomers question Further information is requested

Comments

@andrewazores
Copy link
Member

andrewazores commented Mar 18, 2022

I created some recordings with the web UI, then tested some queries with curl. The backend keeps terminating after the second or third graphql query. Are there any other logs I can check to find out what happened?

POST /api/beta/graphql HTTP/1.1
Accept: application/json, */*;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 621
Content-Type: application/json
Host: localhost:8181
User-Agent: HTTPie/3.0.2

{
    "query": "query {\n    environmentNodes(filter: { name: \"JDP\" }) {\n        name\n        nodeType\n        descendantTargets {\n            recordings {\n                active(filter: { continuous: true, templateLabel: \"template=Cryostat,type=TARGET\" }) {\n                    name\n                    state\n                    metadata {\n                        labels\n                    }\n                }\n                archived {\n                    name\n                    metadata {\n                        labels\n                    }\n                }\n            }\n        }\n    }\n}\n"
}



http: error: ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) while doing a POST request to URL: http://localhost:8181/api/beta/graphql
INFO: (10.0.2.100:51816): GET /api/v1/targets/service%3Ajmx%3Armi%3A%2F%2F%2Fjndi%2Frmi%3A%2F%2Fcryostat%3A9091%2Fjmxrmi/recordings 200 21ms
Mar 18, 2022 5:47:30 PM io.cryostat.core.log.Logger info
INFO: (10.0.2.100:51810): GET /api/v1/recordings 200 28ms
Mar 18, 2022 5:47:40 PM io.cryostat.core.log.Logger info
INFO: Removing cached connection for service:jmx:rmi:///jndi/rmi://cryostat:9091/jmxrmi: EXPIRED
Mar 18, 2022 5:47:40 PM io.cryostat.core.log.Logger info
INFO: Connection for service:jmx:rmi:///jndi/rmi://cryostat:9091/jmxrmi closed
Mar 18, 2022 5:48:45 PM io.cryostat.core.log.Logger info
INFO: Creating connection for service:jmx:rmi:///jndi/rmi://cryostat:10000/jmxrmi
+ cleanup
+ podman pod kill cryostat
cab957e3c614d9e0bafd13451c105f400dc7ffc19cf78aec08744ce478ec993a
+ podman pod rm cryostat
cab957e3c614d9e0bafd13451c105f400dc7ffc19cf78aec08744ce478ec993a

Originally posted by @jan-law in https://github.com/cryostatio/cryostat/pull/868#discussion_r830231817

@andrewazores andrewazores added chore Refactor, rename, cleanup, etc. good first issue Good for newcomers and removed chore Refactor, rename, cleanup, etc. labels Mar 18, 2022
@andrewazores
Copy link
Member Author

  1. It would be nice to do some investigation and see what's actually taking up the memory within the Cryostat container and see if there are any resources we're holding open too long, or caches we aren't clearing and leaking memory to, etc. Cryostat itself can be used for this since we can get all kinds of data about that through JFR profiling.
  2. The 512MB limit is quite small for a JVM anyway, so it's probably reasonable to bump up the limit in run.sh from 512MB to 768MB or 1GB.
  3. The pod is named cryostat and the container within just gets a generated name. It would be clearer and more convenient if they were cryostat-pod and cryostat or some other fixed names.

@andrewazores andrewazores added question Further information is requested chore Refactor, rename, cleanup, etc. labels Mar 18, 2022
@andrewazores
Copy link
Member Author

  1. I did some profiling investigation myself and there is no immediately visible cause for high memory usage - no Old Object Samples indicating things being held onto, no unusually large or frequent allocations. I have been unable to reproduce the OOM or whatever failure it is that @jan-law is experiencing.
  2. At initial startup when all we have done is fire up the JVM and Vert.x we're already sitting at ~340MB. After a couple of warmup requests and a GC cycle apparently occurring I observe it sitting closer to ~430, but often (as expected of the GC behaviour) it appears to be sitting just barely below the 536MB (base10 vs base2) memory limit and only reclaiming what memory is needed as it goes. No major GC was recorded in the JFR profiling, no big STW pause. Regardless, a 512MiB container limit seems like it leaves fairly little overhead.

@jan-law jan-law self-assigned this Mar 30, 2022
@jan-law
Copy link
Contributor

jan-law commented Apr 1, 2022

it appears to be sitting just barely below the 536MB (base10 vs base2) memory limit and only reclaiming what memory is needed as it goes.

Revisiting this, I also haven't been able to reproduce another OOM since I opened this issue. The memory usage is still very close to the 536MB limit though, so I can increase the limit to 768MB.

After I started two batches of recordings and archived them at ~4:43:30, my recording shows cryostat's heap slowly increasing without making any more API calls. Is this expected behaviour due to the additional recording information being stored in cryostat?

image

@andrewazores
Copy link
Member Author

That shouldn't really be causing any real memory consumption increase. The graph is a little suspect looking with such a smooth and consistent slope upward, and there is no data on the other charts either - is there just missing data overall between those points?

Either way, the increased memory usage over time could be allocations occurring for background tasks that Cryostat is processing like periodic rule archivers or JDP discovery. If there is still enough free memory available for the process then you won't see that heap usage drop and the memory reclaimed until the garbage collector decides to run.

@jan-law
Copy link
Contributor

jan-law commented Apr 1, 2022

is there just missing data overall between those points?

Either way, the increased memory usage over time could be allocations occurring for background tasks that Cryostat is processing

Yeah, there's no other data points on that slope. The rest of the Memory charts also show normal behaviour too.

@andrewazores
Copy link
Member Author

Okay. Sounds like something for us to keep an eye on at least. It would also be good if we can spend a bit more time doing some heap analysis before release and make sure we aren't leaking memory or doing anything silly that generates a lot of garbage.

@jan-law
Copy link
Contributor

jan-law commented Apr 7, 2022

I started a profiling recording for ~1hr after making a few requests with graphql and auto rules. Besides the spikes from my requests, the heap usage stays at ~48 MiB. Heap usage looks normal to me.

Most of the allocations are coming from byte[], Object[] and char[]. Stack traces show that most of these were caused by the vert.x-eventloop-thread-3 reading data from JMX connections. Allocations were also caused by parsing the XML template when starting a recording, loading classes when opening the web client, and starting up GraphQL. None of the Old Object Samples show any objects that might be alive for longer than they should be.

When comparing memory usage with ParallelGC vs G1, I’m seeing similar CPU and memory usage with similar GC frequency and pause times. Not sure if there’s any other incentives for switching to ParallelGC.

Sometimes I get this automated analysis result for TLABs. Is there anything we can do about this?
The thread performing the most allocation is likely 'vert.x-eventloop-thread-3'. This is the most common allocation path for that class: null Many allocations performed by the same thread might indicate a problem in a multi-threaded program. Look at the aggregated stack traces for the thread with the highest allocation rate. See if the allocation rate can be brought down, or balanced among the active threads.

@jan-law
Copy link
Contributor

jan-law commented Apr 12, 2022

Revisiting this, I think you can trigger an OOM if you open the web client on localhost:9000, start an automated rule that periodically archives recordings, then use a GraphQL query to start a recording on all targets at the moment when the auto rule is in the middle of archiving several recordings. Is it worth increasing the container memory limit to 768M for this scenario?

@andrewazores
Copy link
Member Author

Probably. That isn't a very far-fetched scenario in a real deployment that actually sees some use. If that sequence can take it down then I'm sure there are many other similar sequences that can, too, with similar low traffic levels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Refactor, rename, cleanup, etc. good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants