High volume of PendingAddOp write failures due to timeouts #2716

devinbost · 2021-05-20T23:11:30Z

I noticed that we are getting a large volume of these messages:

DEBUG org.apache.bookkeeper.client.LedgerHandle - pending add not completed: PendingAddOp(lid:3510117, eid:27, completed:false)
It seemed like they were getting caused by timeouts, so I checked the log for how many timeouts we were seeing:

$ cat /var/log/messages | grep 'Bookie operation timeout' | wc -l
504184

That's over half a million timeouts over 2 hours on a test cluster, and I don't think I even sent that many messages to Pulsar that would have resulted in these PendingAddOps.

I wanted to see the full history of actions on an entryID to get an understanding of the behavior, so I did a grep for it in the log.
Here's what appeared:

DEBUG org.apache.bookkeeper.client.PendingAddOp - Unsetting success for ledger: 3510077 entry: 12098 bookie index: 1
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Could not write Add request to bookie 10.20.69.29/10.20.69.29:3181 for ledger 3510077, entry 12098
DEBUG org.apache.bookkeeper.client.PendingAddOp - Write did not succeed: 3510077, 12098. But we have already fixed it.
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Could not write Add request to bookie 10.20.69.37/10.20.69.37:3181 for ledger 3510077, entry 12098
WARN  org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (3510077, 12098): Bookie operation timeout
. . .
[ Repeats a ton of times ]

It looks like we're returning a success callback in LedgerHandle
but it seems like the future is failing, and then bookkeeper seems to be repeatedly trying to resolve the issue and timing out without reporting the problem to the client.

This is related to apache/pulsar#6054
(In that case, it appears that these PendingAddOp timeouts are causing the broker to not ack the messages.)

I was using BookKeeper 4.12.0, but this may have been an issue in prior versions in our environment.

The text was updated successfully, but these errors were encountered:

devinbost added the type/bug label May 20, 2021

sijie mentioned this issue May 21, 2021

ISSUE-2716: High volume of PendingAddOp write failures due to timeouts streamnative/bookkeeper-achieved#368

Open

devinbost mentioned this issue May 26, 2021

Catastrophic frequent random subscription freezes, especially on high-traffic topics. apache/pulsar#6054

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High volume of PendingAddOp write failures due to timeouts #2716

High volume of PendingAddOp write failures due to timeouts #2716

devinbost commented May 20, 2021 •

edited

Loading

High volume of PendingAddOp write failures due to timeouts #2716

High volume of PendingAddOp write failures due to timeouts #2716

Comments

devinbost commented May 20, 2021 • edited Loading

devinbost commented May 20, 2021 •

edited

Loading