-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer gets stuck after temporary network connection drops #2363
Comments
As a start, let socket.timeout.ms be at least max(fetch.wait.max.ms, session.timeout.ms) + 1000 (preferably more).
Secondly, I'd need some more debugging, the currently level seems to be |
Thanks for your response! We rolled out a new configuration to all clients yesterday, reducing fetch.wait.max.ms from 5000 to 2500. I'm hoping so much that the issue was caused by this! Yes, we logged using 'consumer,cgrp,topic,fetch'. If the problem still occurs, I will update the attached log entries by an extract using 'consumer,cgrp,topic,fetch,broker'. So far it did not happen again, but it's too early to say. By the end of the day it should be more clear. We always had at least 2 occurrences over one day. |
Unfortunately it happened again today, 3 clients were affected. Logging was not enabled. Will enable using 'consumer,cgrp,topic,fetch,broker', and hope to catch an event soon. |
Alright, it happened again on a client that had logging activated. At "2019-06-21 09:32:37" the client was restarted, and then successfully received the messages again. @edenhill is there something you can see from the logs? I tried, but I'm not that fit interpreting the logs. |
This issue is really hitting us badly... I'm thinking of implementing a workaround, like triggering an automatic re-connect. But when would be a suitable time to perform reconnect, also in respect to not cause any additional delay when connections recover. Maybe I could use the Admin Client to periodically check if brokers are gone/available then the have Consumer reconnect. Or is there any better way to detect if the consumer connection is up or down? Also maybe by using the cgrp part of librdkafka statistics to detect if partitions are assigned? |
Those symptoms sound exactly like the problem we are seeing. We are using librdkafka (v1.0.1) on Windows (build with OpenSSL) and connecting via SSL and Client-Certificate to Kafka-Cluster. We have 1000+ clients delivering log-data and many of them stop delivering any data to the Kafka-Cluster. We don't know yet what triggers that problem. Unfortunately we couldn't activate verbose logging on the application so far. However, I was able to profile some of the applications which had this problem and noticed that about 1-4 of the librdkafka-threads run in a more or less tight loop eating up about one CPU-Core each. When we have Debug-Logging for one of the problematic applications I will add the information here. We are also rolling out 1.1.0 right now. This problem is also very critical for us. |
Did not realize that there is a 1.1.0 in the meantime. We might consider to update from 1.0.1 to 1.1.0 as well. Not sure if this change might be related to our problem: @haegele-tv we're not using SSL. CPU looked good, also during the time the problem occurred (we're logging it). We're not going over 50%. Producing is not an issue in our case, but the consumer stops receiving messages. |
We see similar issue where consumer cannot consume from a partition when the client to broker or when is there is a leader change at broker disconnection happens. TestNode is a compact topic. On the restart, the consumer can start consuming again. Can you help me understand this?
|
@edenhill any suggestions on what is going wrong here? |
Any update on this, I suspect we have hit this issue as well. We are using node-rdkafka v2.7.4 which is using librdkafka version v1.2.2 |
The issue is even though it looks up the proper offset thru it doesn't actually use it for the next fetch, but rather OFFSET_INVALID: This has been fixed in da7a0a0#diff-ede56b68a6b39a69e16aea71fd6952b0R1958 which was included in librdkafka v1.3.0 |
I suspect I encountered the similar problem on the producer side, We use vector that depends on librdkafka 1.2.1. |
There was a performance regression in v1.2.1 that was fixed in v1.2.2. |
We only have 10+ messages/second, producer stuck in an endless loop in a broken TCP connection without reconnecting.
|
Hi @tarvip How are you reconnecting to the server on disconnect? You might be interested in this post that I posted today: with a workaround at the app layer. It is only a problem if you are not already using a callback in the connect() method. Nasty one and rare, but it hit us the other day. I am about to file another issue here with another reconnection scenario, but I am in the middle of getting the logs for the report. I had to write a script which periodically enables/disables the firewall port to Kafka and DNS to simulate a network outage, but it paid off, as I managed to reproduce the condition earlier today outside of production. I don't have a workaround for this one unfortunately, as it appears to be no errors from the app layer which I can hook into for a recovery. I will reference the issue number here once I have posted it. |
Issue posted here: |
Description
We have about 70 clients that are connecting to a 3 broker Kafka cluster via WiFi. The clients are moved so the network connection can be off for shorter time (access point roaming) or longer time (when moved out of range). All clients have same configuration. They subscribe to two topics, the important one having a single partition, the other one 3 partitions.
The problem is that sometimes a client does not recover its connection. The topics do not get assigned anymore, even if the connection is restored and stable over a longer time. Only an application restart helps. After restart, all queued messages are successfully received from the broker. Unfortunately we have this issue on production...!
After looking into the logs it seems that the request to get the Metadata fails so that the assignment is not triggered. I was hoping that we were running into #2266, but the problem still persists after updating to 1.0.1 version.
I realized the log entry complaining about fetch.wait.max.ms not being at least 1000ms shorter than socket.timeout.ms. We changed fetch.wait.max.ms to 2500ms and rolled out to a couple of clients. But not sure if the issue might be caused by this. I would expect maybe certain request to fail, but not the consumer to get stuck without to recover.
A log extract is included below. At 2019-06-14 08:43:30 the client was restarted, and the messages were received successfully after that.
How to reproduce
I was not able to reliably reproduce the issue. It seems to be related to bad network quality, but does not always occur. Over the day about 3-5 out of ~70 random clients encounter this issue.
Checklist
1.0.1
2.0.0
[fetch.wait.max.ms, 5000],[socket.timeout.ms, 5000],[bootstrap.servers, xxx],[group.id, 136-Group],[log.connection.close, False],[auto.offset.reset, latest],[enable.auto.offset.store, False],[enable.auto.commit, True],[socket.keepalive.enable, True]
Ubuntu 18.04 x64
debug=..
as necessary) from librdkafkalog-kafka.txt
The text was updated successfully, but these errors were encountered: