-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KafkaConsumer stuck in infinite loop on connection error #1306
Comments
Are you running Also I doubt it matters, but just in case, what broker version? |
We're on |
Apparently I'm having the same problem. It usually (but not always) appears if Kafka goes down (therefore disconnecting clients) and then comes up again. Sometimes clients will reconnect alright, sometimes they will be stuck in this loop. Here's a log extract with a traceback:
This is 1.3.5 on Python 3.5.1 on Ubuntu 16.04 (on a Docker image, with Kafka on another Docker image). |
It also happens on 1.3.4.1, but not on 1.3.3. With 1.3.3, again behavior doesn't seem to be deterministic. Sometimes it reconnects, sometimes it raises socket.gaierror (likewise, 1.3.5 sometimes reconnects, sometimes it goes into the endless loop). |
I hit this earlier today on one of our development environments with a The Kafka cluster had completely crashed, then after the brokers were restarted, the producer never recovered. It just continually spouted these logs:
Given that it's a I watched the tcp stream and verified that nothing was being sent over port 9092, so it's not even trying to talk to Kafka. Once I restarted the producer it worked fine. |
@zackdever do you know if the broker had gone away when this was triggered? |
Ah I wish I knew for sure or had the logs to check, but I can't confirm. I think a broker had gone down when this happened. |
I believe the error here is caused by cached DNS results that are not cleared on failure. My hunch is that this is triggered by dynamic DNS entries causing a failure at the broker level to cause a DNS failure. I put up a PR w/ unit test that should fix (if my diagnosis is correct). |
That may be true. I suspect the reason Kafka crashed in our case because the underlying VM was shutdown and moved, so the IP address would have changed. |
This might explain what I'm having. I'm actually using a group of Docker containers for an integration test, and I'm testing resilience by shutting down the Kafka container and bringing it up again. I'm not certain what Docker does in this case, but if sometimes it gives the restarted container the same (internal) ip address and sometimes a different one, this could be the reason why sometimes I get different behavior than other times. |
Thanks @aptiko. Also, it sounds like you're the only person here with a reproducible setup.
|
I actually think it is pretty easy to reproduce at a unit test level. The log entry that gives it away is
|
I'm creating a minimal setup that reproduces the problem and I think the problem occurs when the Kafka container is stopped, not when it is restarted. I'll let you know more when it's ready. |
It is as I said; the problem occurs when the Kafka container is stopped. Here's the demo: https://github.com/aptiko/kafka-python-1306-demo Easy to run if you have docker and docker-compose experience, harder if you don't. I may be a bit slow to respond to requests but eventually I will. |
Sorry guys but I didn't get the solution, I mean: In my python code I am using KafkaProducer & KafkaConsumer, when kafka goes down I have never the control in my python code to "manage" exceptions in some way and trigger some internal logic. When I was using the version of kafka-python 1.3.3 I was able to do what I described so far, now it seems I haven't the control. let's assume I am consuming messages in this way:
How I can fix the problem?
|
@darkprisco if you are hit by this bug, use 1.3.3 (or master); there's yet no release that contains this fix. |
Apologies for not having this released yet. There are a few other lingering issues that I feel like need to be addressed before pushing a new release. I have added a milestone and tagged those issues. If anyone has opinions about release priorities, though, I'm all ears!! |
@dpkp Does the fix for this bug was release on version
And because of the Some logs:
|
I have observed same issue. I am using kafka 1.4.6. Please reopen this issue. |
same error on kafka-python==2.0.2 |
I met the same question with kafka-python==2.0.1. |
|
I had a similar connection error when a Python Kafka App (client) was running in docker and the Kafka broker was running on my machine and not in docker or any container, listening on listeners=PLAINTEXT://localhost:9092 and nothing set for So, once a client (be it consumer/producer) connects to the broker using the right hostname and port, it happens to be listeners=PLAINTEXT://:9092,DOCKER://:19092
advertised.listeners=PLAINTEXT://localhost:9092,DOCKER://host.docker.internal:19092
listener.security.protocol.map=PLAINTEXT:PLAINTEXT,DOCKER:PLAINTEXT Once this config is in place and the broker is restarted, you can connect to it from any docker app using For better understanding, please read this excellent blog post https://www.confluent.io/blog/kafka-listeners-explained/ |
It seems to be stuck in this loop
kafka-python/kafka/conn.py
Line 294 in 34dc9dd
The consumer filled up ~1TB logs over the course of 3 days, but did not throw an exception. Example logs:
The text was updated successfully, but these errors were encountered: