-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ListOffsets loop of failed requests on leader epoch change until timeout happens #4620
Comments
@davidblewett no, the two issues seem unrelated. We're fixing this one though |
While for the other one we're evaluating a fix, this one isn't occurring, as a metadata refresh is requested and the ListOffsets request is migrated to the new leader after a leader epoch update and a new ListOffset request is made. Some extracted logs when it's returning "Not leader for partition"
Even when we allow to list offsets on the follower with ReplicaId -2 and the error becomes
Full logs after allowing to ListOffsets on a follower
|
@emasab does this only affect consumers using fetch from follower? For the record we're seeing something similar with regular consumers. A single broker is restarted which causes partition leader changes, with the other 2 brokers that are not restarted witnessing
We use confluent-kafka-go/librdkafka 2.5.0. |
This bug was closed because what was reported was the suspected behaviour, but then from the tests it's shown that it's prevented actually. In the description it's related to the leader epoch change. |
Our (Go) consumers are getting stuck and we are not using Fetch From Follower. We've had to downgrade to 2.3.0. |
This issue was not based on an actual incident, so please create a different issue with debug logs, while using latest version, so we can check what happened and give it a correct description or link it to a different known bug. From your description you could be hitting this one, but only with logs we can be sure. |
Description
ListOffsets requests done for partitions with no committed offsets can be retried indefinitely if that partition leader epoch has changed, because the buffer is retried without recreating it with the new CurrentLeaderEpoch received from the Metadata refresh call.
How to reproduce
Start consuming partitions that have no committed offset, or seek to the latest offset. A partition leader change should happen that changes the current leader epoch to a value higher than the cached one. The ListOffsets request give a FENCED_LEADER_EPOCH and then it refreshes Metadata, but starts retrying the buffer with the same CurrentLeaderEpoch, leading to a loop of failed requests.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
2.1.0+
any
any
any
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: