Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observing unhealthy "Readiness probe failed" and error "ERROR Closing ledger 23272 due to NotEnoughBookiesException: Not enough non-faulty bookies available" events in Bookkeeper #146

Closed
vedanthh opened this issue Mar 28, 2019 · 3 comments
Assignees
Labels
kind/bug Something isn't working priority/P2 Slight inconvenience or annoyance to applications, system continues to function

Comments

@vedanthh
Copy link

vedanthh commented Mar 28, 2019

Observing unhealthy "Readiness probe failed" and error "ERROR Closing ledger 23272 due to NotEnoughBookiesException: Not enough non-faulty bookies available" events in Bookkeeper during Longevity run with moderate workload (Total - 7 readers, 14 writers, 2500 events/sec, ~ 20 MB/s IO)

Environment details: PKS / K8 with medium cluster:

3 master: xlarge: 4 CPU, 16 GB Ram, 32 GB Disk
5 worker: 2xlarge: 8 CPU, 32 GB Ram, 64 GB Disk
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend
Pravega version: zk-closed-client-issue-0.5.0-2162.0bbfa42
Zookeeper Operator : pravega/zookeeper-operator:0.2.1
Pravega Operator: adrianmo/pravega-operator:pr-145-2

Bookie_logs_describe_events.zip

# cat bookie-0-describe.log
Name:               pravega-bookie-0
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               d12123f8-1a60-449a-8aec-bc410f298b60/172.23.8.7
Start Time:         Wed, 27 Mar 2019 06:59:07 -0400
Labels:             app=pravega-cluster
                    component=bookie
                    controller-revision-hash=pravega-bookie-77fcdf8877
                    pravega_cluster=pravega
                    statefulset.kubernetes.io/pod-name=pravega-bookie-0
Annotations:        <none>
Status:             Running
IP:                 172.24.106.14
Controlled By:      StatefulSet/pravega-bookie
Containers:
  bookie:
    Container ID:   docker://37515387c88329af308a477933efe8ef1c940f075812d04d854b2ef0440989fb
    Image:          devops-repo.isus.emc.com:8116/nautilus/bookkeeper:zk-closed-client-issue-0.5.0-2162.0bbfa42
    Image ID:       docker-pullable://devops-repo.isus.emc.com:8116/nautilus/bookkeeper@sha256:c1e6087edba67e8b889349b8dd465cd512ee85a0c5517e876744137a94649cf3
    Port:           3181/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 27 Mar 2019 06:59:14 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  4Gi
    Requests:
      cpu:      1
      memory:   2Gi
    Liveness:   exec [/bin/sh -c netstat -ltn 2> /dev/null | grep 3181 || ss -ltn 2> /dev/null | grep 3181] delay=60s timeout=1s period=15s #success=1 #failure=4
    Readiness:  exec [/bin/sh -c /opt/bookkeeper/bin/bookkeeper shell bookiesanity] delay=20s timeout=1s period=10s #success=1 #failure=9
    Environment Variables from:
      pravega-bookie  ConfigMap  Optional: false
    Environment:      <none>
    Mounts:
      /bk/index from index (rw)
      /bk/journal from ledger (rw)
      /bk/ledgers from journal (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-q8r2x (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  journal:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  journal-pravega-bookie-0
    ReadOnly:   false
  ledger:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  ledger-pravega-bookie-0
    ReadOnly:   false
  index:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  index-pravega-bookie-0
    ReadOnly:   false
  default-token-q8r2x:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-q8r2x
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age   From                                           Message
  ----     ------     ----  ----                                           -------
  Warning  Unhealthy  19m   kubelet, d12123f8-1a60-449a-8aec-bc410f298b60  Readiness probe failed: 02:27:42,983 INFO  Created ledger 23272
02:27:45,183 ERROR Could not get additional bookie to remake ensemble, closing ledger: 23272
02:27:45,183 ERROR Closing ledger 23272 due to NotEnoughBookiesException: Not enough non-faulty bookies available
02:27:45,184 ERROR Write of ledger entry to quorum failed: L23272 E0
02:27:45,189 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
  at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
  at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
  at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
  at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
  at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
  at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
  at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
  at java.lang.Thread.run(Thread.java:748)
02:27:45,204 INFO  Deleted ledger 23272
  Warning  Unhealthy  9s  kubelet, d12123f8-1a60-449a-8aec-bc410f298b60  Readiness probe failed: 02:47:03,022 INFO  Created ledger 23765
02:47:05,976 ERROR Could not get additional bookie to remake ensemble, closing ledger: 23765
02:47:05,982 ERROR Closing ledger 23765 due to NotEnoughBookiesException: Not enough non-faulty bookies available
02:47:05,986 ERROR Write of ledger entry to quorum failed: L23765 E7
02:47:05,988 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
  at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
  at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
  at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
  at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
  at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
  at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
  at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
  at java.lang.Thread.run(Thread.java:748)
02:47:06,071 ERROR Error update ledger metadata for ledger 23765 : NoSuchLedgerExistsException: No such ledger exists
02:47:06,098 INFO  Deleted ledger 23765
@Tristan1900 Tristan1900 added kind/bug Something isn't working priority/P2 Slight inconvenience or annoyance to applications, system continues to function labels Mar 29, 2019
@Tristan1900
Copy link
Member

Tristan1900 commented Mar 29, 2019

Thanks for the report!
According to the result of kubectl descirbe, all Bookies have experienced Readiness probe failed starting from 02:26:07 to 02:55:32. The reason is Not enough non-faulty bookies available when writing entries to the ledger, which means the local Bookie is not available. Every Bookie has such error at most twice, however, they keep working despite the Readiness probe failed and they continue to work for the next three hours.
Here is a snippet of log when Readiness probe succeeds

2019-03-28 02:46:43,100 - INFO  - [bookie-io-1-2:AuthHandler$ServerSideHandler$AuthHandshakeCompleteCallback@211] - Authentication success on server side
2019-03-28 02:46:43,100 - INFO  - [bookie-io-1-2:BookieRequestHandler@50] - Channel connected  [id: 0x39ace201, L:/172.24.106.14:3181 - R:/172.24.106.14:48170]
2019-03-28 02:46:43,459 - INFO  - [BookieHighPriorityThread-3181-OrderedExecutor-4-0:ReadEntryProcessorV3@206] - Ledger fence request received for ledger: 23756 from address: /172.24.106.14:48170
2019-03-28 02:46:43,567 - INFO  - [bookie-io-1-2:BookieRequestHandler@61] - Channels disconnected: [id: 0x39ace201, L:/172.24.106.14:3181 ! R:/172.24.106.14:48170]

And here is the log when Readiness probe fails

2019-03-28 02:47:03,271 - INFO  - [bookie-io-1-4:AuthHandler$ServerSideHandler$AuthHandshakeCompleteCallback@211] - Authentication success on server side
2019-03-28 02:47:03,271 - INFO  - [bookie-io-1-4:BookieRequestHandler@50] - Channel connected  [id: 0x76ebcbce, L:/172.24.106.14:3181 - R:/172.24.106.14:48214]
2019-03-28 02:47:06,283 - INFO  - [bookie-io-1-4:BookieRequestHandler@61] - Channels disconnected: [id: 0x76ebcbce, L:/172.24.106.14:3181 ! R:/172.24.106.14:48214]

As we can see there is no error on Bookie side. I feel that this error is probably negligible when comparing to the number of successful Readiness probe.

@Tristan1900
Copy link
Member

To complement the above.
The Pravega operator is using a readiness probe to determine whether the BK is ready to serve traffic and this readiness probe fires every 10 seconds. What the probe does is basically creating a ledger and write/read entries on the local bookie, see official doc.
From the provided log we can see that the readiness probe fails because of this line

02:27:45,184 ERROR Write of ledger entry to quorum failed: L23272 E0

which means that the probe fails to write an entry to the ledger.
We don't know what causes this write failure from what we have right now, it could be caused by temporary network failure or a hardware problem or a Bookkeeper bug.
Personally, I don't think this issue needs to be fixed because

  1. this probe failure is not caused by or causing any other Bookkeeper exceptions
  2. and it happened only twice compared to 2000 successful readiness probings during the test
  3. and it is very likely that this is not a bug on Pravega or Pravega operator side

I think it would be a problem if the readiness probe fails frequently in a short period or it fails constantly during the test. Right now I feel we could close this issue.
What do you think @adrianmo ? Thanks!

@adrianmo
Copy link
Contributor

Right, @Tristan1900, it looks like this is a very sporadic issue caused by an external factor and it's not causing any interruption of the service.

I'll close the issue. Please open a new one if the issue is manifested frequently or is causing any disruption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working priority/P2 Slight inconvenience or annoyance to applications, system continues to function
Projects
None yet
Development

No branches or pull requests

3 participants