Elasticsearch cluster OOM caused by JVM heap size #325

cehoffman · 2018-04-10T13:30:16Z

/kind bug

What happened:

Elasticsearch cluster OOM due to successive time spent in GC due to non continuous memory allocation.

What you expected to happen:

Elasticsearch cluster continued to operate through an influx of data.

How to reproduce it (as minimally and precisely as possible):

This basically the same cluster as has been running with a static and preallocated block of memory, but just through navigator and pilot instead of the standalone pilot. Upon cycling the data producers a sharp influx of data is generated. This sharp increase can be handled on a freshly restarted cluster, but this one has been running under navigator for about a week and had grown the JVM heap to max out the container. On receiving the influx of data, the data nodes started to stall due to excessive time spent in JVM GC.

This same configuration, but using a static heap configuration with actually less total memory allocated to the JVM does not encounter that problem. It is crucial to set upfront JVM heap allocation.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration**:
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

cehoffman · 2018-04-14T06:49:28Z

Found another instance where lack of static JVM heap resulted in bad behavior. Compacting the segments on an index results in OOM where I never encountered this with the lower static max heap.

munnerz · 2018-04-17T09:58:45Z

We're going to look into whether we can enable automatic cgroup detection on the JVM in order to support this. That should ensure the JVM pins the heap size to the total available memory in the cgroup.

/cc @kragniz

cehoffman · 2018-04-17T17:55:19Z

Looks like this is automatic in version 10 of the JVM. I wonder when tools will support that.

cehoffman · 2018-05-05T23:18:49Z

@munnerz how does this play with the memory that pilot is using? Doesn't there need to be some buffer of memory to allow for the GC profile of pilot in the same container as the JVM?

munnerz · 2018-05-11T14:23:26Z

@cehoffman we probably should reserve a small amount of memory for the Pilot itself. At present it should have a pretty minimal memory footprint, but that could become an issue when/if the JVM locks all available memory.

cehoffman · 2018-11-30T00:58:38Z

Pilot as is can consume quite a bit of memory. We have adjusted to a customized elasticsearch container that wraps the elasticsearch binary and computes the heap size based on the container limits. Right now we set aside a static 800 mebibytes for pilot and we occasionally encounter OOM on the containers.

cehoffman · 2018-12-03T17:22:46Z

Dug into this a bit more since I didn't quite think pilot should be the culprit. The metrics we have for a 9 member cluster show pilot using about 12MiB for non leaders. The reason we started getting OOM is a combination of trying to more fully dynamically use allocated memory for instances, e.g. compute heapsize on start, and changes in Java 10/11 to always detect when running in a container and adjust to use the full resources of the container. Unreserving 200MiB more space from the JVM heap configuration seems to have alleviated our problem at the moment for a 1TB size cluster.

Going forward I think the architecture of pilot residing with the DB needs to change. The JVM should behave when it has full control of the container resources. With the shared PID namespace feature in 1.10, I think the only valid argument for pilot to be in the same container disappears. Startup ordering is not required to be handled at the instance and could be done from the navigator controller instead.

jetstack-bot added the kind/feature label Apr 10, 2018

wallrj added this to the v0.2 milestone May 15, 2018

jetstack-bot added the kind/bug label May 15, 2018

wallrj changed the title ~~JVM heap size~~ Elasticsearch cluster OOM caused by JVM heap size May 15, 2018

wallrj removed the kind/feature label May 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch cluster OOM caused by JVM heap size #325

Elasticsearch cluster OOM caused by JVM heap size #325

cehoffman commented Apr 10, 2018 •

edited by wallrj

Loading

cehoffman commented Apr 14, 2018

munnerz commented Apr 17, 2018

cehoffman commented Apr 17, 2018

cehoffman commented May 5, 2018

munnerz commented May 11, 2018

cehoffman commented Nov 30, 2018

cehoffman commented Dec 3, 2018

Elasticsearch cluster OOM caused by JVM heap size #325

Elasticsearch cluster OOM caused by JVM heap size #325

Comments

cehoffman commented Apr 10, 2018 • edited by wallrj Loading

cehoffman commented Apr 14, 2018

munnerz commented Apr 17, 2018

cehoffman commented Apr 17, 2018

cehoffman commented May 5, 2018

munnerz commented May 11, 2018

cehoffman commented Nov 30, 2018

cehoffman commented Dec 3, 2018

cehoffman commented Apr 10, 2018 •

edited by wallrj

Loading