-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kafka consumer gets stuck after exceeding max.poll.interval.ms #344
Comments
Unless you are using the channel consumer (which you shouldn't use), you need to call Poll() or ReadMessage() at least every max.poll.interval.ms-1. |
See the "max.poll.interval.ms is enforced" chapter here: https://github.com/edenhill/librdkafka/releases/v1.0.0 |
Hello @edenhill, I’m running into a similar issue as the original poster, I’m using a
My producer stopped writing messages for a few minutes and I logged this:
Subsequently my producer was back up but the consumer seemed to be hanging on According to the doc you linked to:
I’d expect that my consumer did indeed leave the group, but the subsequent call to Is this a configuration issue? Would using a shorter timeout for Or is this a manifestation of confluentinc/librdkafka#2266? librdkafka version: |
This looks like confluentinc/librdkafka@80e9b1e which is fixed in librdkafka 1.1.0. |
i am having this issue too, how to fix this anyway? |
I am having this issue with librdkafka 1.5.0, exactly as keyan said. Can anyone help? |
having this this issue as well with |
my code:
|
To clarify, are you all seeing that your consumer won't rejoin on the subsequent FWIW, after upgrading to the v1.1.0 client and also changing from a |
Using v1.5.2. Also calling |
I'm using
Restarting the worker fixes the problem. It seems to be a |
What means of consuming messages are you using? Channel consumer? ReadMessage? Poll? |
|
And you are calling ReadMessage() more often than the max.poll.interval.ms (30s) ? |
yes:
|
@edenhill Why is this issue closed? I'm seeing the same thing as everyone else, also calling I personally see this issue associated with a broker rebalance. On another thread prior to the stuck consumer symptom everyone else reports I see this:
|
I am having the same issue as well |
There is something weird in the code. |
Same issue here. If for some reason @edenhill this seems like a bug? It's contrary to what the "max.poll.interval.ms is enforced" chapter here suggests: https://github.com/edenhill/librdkafka/releases/v1.0.0
|
i am having this issue too, how to fix this anyway? "%4|1627976105.775|MAXPOLL|rdkafka#consumer-1| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 29ms (adjust max.poll.interval.ms for long-running message processing): leaving group" |
+1 |
1 similar comment
+1 |
confluent-kafka-go: 1.5.2 |
+1 |
Please try to reproduce this on the latest release (1.8.2) with debug=cgrp enabled. |
I got this same issue on 1.8.2. As this issue came on Production so there I don't have debug=cgrp enabled %4|1639508773.124|MAXPOLL|rdkafka#consumer-4| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 196ms (adjust max.poll.interval.ms for long-running message processing): leaving group |
Encountering this too ... randomly appears even after poll of 0.1 ms: 2021-12-27 04:53:05,259 - some error in message: <cimpl.Message object at 0x000001B7B16D9140> code: -147 errorstr: Application maximum poll interval (300000ms) exceeded by 307ms |
I encounter this error often also, librdkafka 1.7.0. I just hope i can detect this error and restart K8s pod though |
Have also found this issue happening in production as well on Also, when it does get stuck, it seems like there are |
It would be great with a reproduce with debug=cgrp enabled so we can figure out what is going on. |
|
Add us to the set of people who are definitely seeing this problem despite calling Poll far more frequently than max.poll.interval.ms. In our case, we implemented a heartbeat that gets emitted from the loop that contains Poll() and have a separate thread that will alert if the heartbeat stops. We are seeing max poll interval exceeded and getting kicked out of the consumer group even though the heartbeat is continuous. Additionally, we are also checking for kafka errors in the poll results, specifically looking for the One hypothesis we are about to test is that this is caused by linking dynamically to librdkafka when doing a musl build when using an alpine container, which might explain why no one at confluent seems able to reproduce this behaviour when so many of us are seeing it. |
We've run into this issue as well - the consumer gets hung when it's working with a topic/partition with a huge backlog. We could work around this by handling var rebalanceCb func(c *kafka.Consumer, e kafka.Event) error
rebalanceCb = func(c *kafka.Consumer, e kafka.Event) error {
zap.L().Info("Got kafka partition rebalance event: ", zap.String("topic", topic), zap.String("consumer", c.String()), zap.String("event", e.String()))
switch e.(type) {
case kafka.RevokedPartitions:
// Resubscribe to the topic to get new partitions assigned.
err := c.SubscribeTopics([]string{topic}, rebalanceCb)
if err != nil {
return err
}
}
return nil
} And FWIW, we also run the consumer on arm64 machines with dynamically linked librdkafka. |
We also facing same issue, by using @bothra90 's solution, can make application work properly.. |
Alright we are seeing this issue as well, and have tried all of the available solutions in this issue. We consistently get this error even if we don't do anything at all with the message after polling and have a tight loop calling many hundreds of times a second.... Clearly something wrong in the underlying library interaction here between go <> librdkafka 🤔 fwiw we are running on x86_64 machines inside of debian based containers. We have metrics emitted for every time we call Poll so we know exactly how often we are calling the method, and also have a histogram for tracking how long we wait in the poll. As mentioned above we are calling 100's of times a second and are seeing single digit ms latency for the call at the p99 quantile. We also have the The rebalance fix above "works" in that we re-subscribe and start collecting messages again, however this causes a ton of thrashing in our cluster, so its not ideal. |
Do you happen to adjust amount of bytes you fetch in your consumers? There is this idea that it could be related, see comment: zendesk/racecar#288 (comment) |
I found upgrade to v.2.1.1 can solve this question which fixed #980 |
We are seeing this same issue in 2.4.0 @edenhill We poll from kafka in a busy loop, so we should see the error in the Poll return, right? We don't get any rebalance revoke event on this either. |
Description
When the consumer does not receives a message for 5 mins (default value of
max.poll.interval.ms
300000ms) the consumer comes to a halt without exiting the program. The consumer process hangs and does not consume any more messages.The following error message gets logged
I see that
ErrMaxPollExceeded
is defined here but unable to find where it is getting raised.If any such error is raised, why does the program not exit ?
Checklist
Please provide the following information:
confluent_kafka.version(master)
andconfluent_kafka.libversion(1.0.0)
):{ "bootstrap.servers": "my.kafka.host", "group.id": "my.group.id", "auto.offset.reset": "earliest", "enable.auto.commit": false }
'debug': '..'
as necessary)The text was updated successfully, but these errors were encountered: