-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Producer thread stuck in infinite loop #1397
Comments
Thank you, will try to reproduce. |
Correct :) thank you |
Btw, I may be stating the obvious here, but just to clarify - I'm not too concerned about the busy loop, as I am about the fact messages are not being sent. Adding a sleep or something like that, will not solve the root problem. |
@edenhill, did you get a chance to check it out? Thanks! |
Some more info here - I added some monitoring to the system, and yesterday it happened again. The monitoring reported errors on a couple of Kafka brokers (we have 4 in total) few min before the issue started. Kafka broker 1:
Kafka broker 2:
Producer:
Hope it will help figuring this one out |
It would be great if you could reproduce this on latest librdkafka master with |
Ok, pulled master and enabled debug, will update this thread when it happens, thanks! |
@edenhill, I made some changes to Kafka today -
After I did that, the producers got stuck in a similar fashion to what happened before. I'm not entirely sure it's the same issue, since in the previous cases I reported, I didn't make any change to Kafka. Thanks!
|
Thank you, there's unfortunately nothing in these logs that gives any indication why the CPU usage would be so high. |
@edenhill We've experienced a bug similar to this one. It seems to only occur in the face of less than perfect (or at least high latency) network because we have been unable to reproduce it in testing. The "protocol" debug produces a prohibitively large amount of logs for us to enable in production; would debug=broker,topic,metadata provide sufficient insight to pursue this issue further? |
@dacjames Yeah, that would be a good start. Thanks |
@edenhill Is there any additional information what would be helpful to gather? We can deploy custom builds with any additional instrumentation or logging if you have a guess as to where the problem might be. |
With a zero rkb_blocking_max_ms the ua_idle() function would not serve broker queues and IOs resulting in the broker thread practically hanging and busy-looping. This fix makes sure we serve it at least once per spin. The previous fix makes sure rkb_blocking_max_ms is not 0.
That's great! thanks @edenhill ! |
With a zero rkb_blocking_max_ms the ua_idle() function would not serve broker queues and IOs resulting in the broker thread practically hanging and busy-looping. This fix makes sure we serve it at least once per spin. The previous fix makes sure rkb_blocking_max_ms is not 0.
@edenhill Will test, thanks! We're still trying to get those logs, too; we should have them if this fix does not work. |
With a zero rkb_blocking_max_ms the ua_idle() function would not serve broker queues and IOs resulting in the broker thread practically hanging and busy-looping. This fix makes sure we serve it at least once per spin. The previous fix makes sure rkb_blocking_max_ms is not 0.
With a zero rkb_blocking_max_ms the ua_idle() function would not serve broker queues and IOs resulting in the broker thread practically hanging and busy-looping. This fix makes sure we serve it at least once per spin. The previous fix makes sure rkb_blocking_max_ms is not 0.
With a zero rkb_blocking_max_ms the ua_idle() function would not serve broker queues and IOs resulting in the broker thread practically hanging and busy-looping. This fix makes sure we serve it at least once per spin. The previous fix makes sure rkb_blocking_max_ms is not 0.
@edenhill , do we now close the connection if the producer is stuck due to some network glitches ? |
The network connection will be closed as soon as a request times out. |
#1575 Abhi |
Hi,
Description
After running on production for some time, I'm seeing the librdkafka thread on 100% cpu.
Call stack is:
From what I saw, rkb->rkb_state is RD_KAFKA_BROKER_STATE_CONNECT and rkb->rkb_blocking_max_ms is 0. So, rd_kafka_broker_ua_idle returns immediately, and rd_kafka_broker_thread_main doesn't really do anything in this state.
If I understand correctly, this means that librdkafka issued some connect request that got stuck for some reason, and it's polling for it indefinitely. If that is correct - is it possible to configure some timeout on connect?
Other than that, the log is full with
kafka message delivery failed: Local: Message timed out
, getting hundreds of these per second.I'm leaving the process in this state for now, in case there any further info you want me to extract from it.
How to reproduce
Unfortunately, I have no way to reproduce this, it happens only after running for a long time in production.
Checklist
Please provide the following information:
debug=..
as necessary) from librdkafkaAs mentioned above, getting hundreds of these per second -
kafka message delivery failed: Local: Message timed out
Thank you
Eran
The text was updated successfully, but these errors were encountered: