Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High volume of PendingAddOp write failures due to timeouts #2716

Open
devinbost opened this issue May 20, 2021 · 0 comments
Open

High volume of PendingAddOp write failures due to timeouts #2716

devinbost opened this issue May 20, 2021 · 0 comments
Labels

Comments

@devinbost
Copy link

devinbost commented May 20, 2021

I noticed that we are getting a large volume of these messages:

DEBUG org.apache.bookkeeper.client.LedgerHandle - pending add not completed: PendingAddOp(lid:3510117, eid:27, completed:false)
It seemed like they were getting caused by timeouts, so I checked the log for how many timeouts we were seeing:

$ cat /var/log/messages | grep 'Bookie operation timeout' | wc -l
504184

That's over half a million timeouts over 2 hours on a test cluster, and I don't think I even sent that many messages to Pulsar that would have resulted in these PendingAddOps.

I wanted to see the full history of actions on an entryID to get an understanding of the behavior, so I did a grep for it in the log.
Here's what appeared:

DEBUG org.apache.bookkeeper.client.PendingAddOp - Unsetting success for ledger: 3510077 entry: 12098 bookie index: 1
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Could not write Add request to bookie 10.20.69.29/10.20.69.29:3181 for ledger 3510077, entry 12098
DEBUG org.apache.bookkeeper.client.PendingAddOp - Write did not succeed: 3510077, 12098. But we have already fixed it.
DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Could not write Add request to bookie 10.20.69.37/10.20.69.37:3181 for ledger 3510077, entry 12098
WARN  org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (3510077, 12098): Bookie operation timeout
. . .
[ Repeats a ton of times ]

It looks like we're returning a success callback in LedgerHandle
but it seems like the future is failing, and then bookkeeper seems to be repeatedly trying to resolve the issue and timing out without reporting the problem to the client.

This is related to apache/pulsar#6054
(In that case, it appears that these PendingAddOp timeouts are causing the broker to not ack the messages.)

I was using BookKeeper 4.12.0, but this may have been an issue in prior versions in our environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant