-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bookie restarted with exception "java.lang.OutOfMemoryError: Java heap space" #141
Comments
@vedanthh, could you describe the resources requested for each of the services? ( |
Attaching logs |
@vedanthh please use the latest operator version ( You can use the resource values set in the example file here: https://github.com/pravega/pravega-operator/blob/master/example/cr-detailed.yaml Your PKS should be able to handle it. EDIT: I was looking at the wrong logs. @vedanthh is actually setting high resource limits in the manifest. |
@vedanthh @adrianmo I see the following:
It is expected that if there are only 2 out of 4 working Bookies, the Segment Store eventually crashes because the write quorum is not met (by default 3). The thing here is: why the Bookies are failing? We may need BK and ZK logs for understanding that. |
I've seen that the following error log when Bookies start:
It looks like JVM options are not being set. This can explain why @vedanthh is hitting OOM issues despite setting high resource limits in the manifest. I'll investigate it and get back to this issue. |
Attaching Bookie,controller and zk logs |
@adrianmo I created a cluster with
|
ISSUE - 141.zip
|
@vedanthh based on Bookie pod descriptions, I see that you haven't set proper resource requests and limits.
Please set custom resource requests and limits as in the following example manifest: https://github.com/pravega/pravega-operator/blob/master/example/cr-detailed.yaml Thanks. |
@adrianmo Observing Bookie restart during longevity run with
|
@vedanthh the initial BK restarts are expected as they all come up at the same time and they all try to initialize the BK cluster, but only one of them succeeds. However, that OOMKilled is something we should never see. It means that the Kubernetes framework has killed the container for consuming too much memory. @Tristan1900 found out something very interesting in this regard (see: #145 (comment)) and I made the modifications in the PR. I've built a new image that contains the latest modifications with @Tristan1900 suggestions that should prevent Kubernetes from OOMKilling the containers. Image is Thanks! |
@adrianmo I created a cluster with adrianmo/pravega-operator:pr-145-2 and started Longevity run on same , will share my timely observations. |
@adrianmo, regarding initial BK restart, could we implement like Segmentstore logic a pod readiness check and sequence initialization of BKs like start first BK and make it healthy / running then initialize second BK to avoid these restart? Any restart doesn't look good and earlier we used to not see that. You have implemented #102, would something similar help here? |
@adrianmo Longevity going fine for ~14 hours with But I'm Observing unhealthy "Readiness probe failed" and error "ERROR Closing ledger 23272 due to NotEnoughBookiesException: Not enough non-faulty bookies available" events in Bookkeeper hence reported #146 issue. |
@sumit-bm that's would work, but the reason why we want to start all Bookies at the same time is because the Segmentstores wait until all Bookies are ready. Starting them sequentially would increase the deployment time. I'll explore other options to prevent Bookies from restarting at the beginning. @vedanthh good news - I'll merge PR #145, close this issue, and release a new operator. Please reopen the issue if this OOM error pops up again. Thanks! |
Fixed in #145 |
While running moderate IO workload in Longevity run (Total - 7 readers, 14 writers, 2500 events/sec, ~ 20 MB/s IO) observed multiple bookie (bookie1,bookie 2 and bookie3) restarted with exception java.lang.OutOfMemoryError: Java heap space on PKS environment.
Environment details: PKS / K8 with medium cluster:
bookie_controller_log.zip
The text was updated successfully, but these errors were encountered: