Producer thread stuck in infinite loop #1397

erankor · 2017-08-28T13:36:49Z

Hi,

Description

After running on production for some time, I'm seeing the librdkafka thread on 100% cpu.
Call stack is:

#0  rd_kafka_broker_ua_idle (rkb=rkb@entry=0x7f20fc008290, timeout_ms=0, timeout_ms@entry=-1) at rdkafka_broker.c:3316
#1  0x00007f21137b7afd in rd_kafka_broker_thread_main (arg=arg@entry=0x7f20fc008290) at rdkafka_broker.c:4731
#2  0x00007f21137fdc77 in _thrd_wrapper_function (aArg=<optimized out>) at tinycthread.c:624
#3  0x00007f2113c8f6ba in start_thread (arg=0x7f210de3e700) at pthread_create.c:333
#4  0x00007f21130433dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

From what I saw, rkb->rkb_state is RD_KAFKA_BROKER_STATE_CONNECT and rkb->rkb_blocking_max_ms is 0. So, rd_kafka_broker_ua_idle returns immediately, and rd_kafka_broker_thread_main doesn't really do anything in this state.
If I understand correctly, this means that librdkafka issued some connect request that got stuck for some reason, and it's polling for it indefinitely. If that is correct - is it possible to configure some timeout on connect?
Other than that, the log is full with kafka message delivery failed: Local: Message timed out, getting hundreds of these per second.
I'm leaving the process in this state for now, in case there any further info you want me to extract from it.

How to reproduce

Unfortunately, I have no way to reproduce this, it happens only after running for a long time in production.

Checklist

Please provide the following information:

librdkafka version (release number or git tag): I took master couple of months ago, I'm standing on commit 2adde00
Apache Kafka version: kafka_2.11-0.10.1.1
librdkafka client configuration: setting these parameters

bootstrap.servers = ...
request.required.acks = 0
compression.codec = snappy
log_level = 6
message.send.max.retries = 0
queue.buffering.max.messages = 100000
retry.backoff.ms = 10

Operating system: Ubuntu 16.04.2 LTS
Using the legacy Consumer
Using the high-level KafkaConsumer
Provide logs (with debug=.. as necessary) from librdkafka
As mentioned above, getting hundreds of these per second - kafka message delivery failed: Local: Message timed out
Provide broker log excerpts
Critical issue

Thank you

Eran

The text was updated successfully, but these errors were encountered:

edenhill · 2017-08-28T13:57:31Z

Thank you, will try to reproduce.
I don't believe it is an infinite loop though, albeit a busy-loop. It will terminate when the instance is closed.

erankor · 2017-08-28T14:33:06Z

Correct :) thank you

erankor · 2017-08-30T20:31:26Z

Btw, I may be stating the obvious here, but just to clarify - I'm not too concerned about the busy loop, as I am about the fact messages are not being sent. Adding a sleep or something like that, will not solve the root problem.
IMHO the way to approach this is to add some timeout on connect. When the timeout expires, the library will close the socket and issue another connect.

erankor · 2017-09-10T12:18:58Z

@edenhill, did you get a chance to check it out?
I ended up restarting the process, but the problem reoccurred a couple of times since I opened this issue. I have 2 servers, and in the last occurrence, it happened only on one of them, so I think it's caused by some temporary network glitch.
Do you think my suggestion for adding a timeout on connect makes sense? If it does, I'll see if I can create a pull for it.

Thanks!

erankor · 2017-09-15T04:52:21Z

Some more info here - I added some monitoring to the system, and yesterday it happened again. The monitoring reported errors on a couple of Kafka brokers (we have 4 in total) few min before the issue started.

Kafka broker 1:

[2017-09-14 22:12:40,302] ERROR [ReplicaFetcherThread-0-1001], Error for partition [__consumer_offsets,38] to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-09-14 22:12:40,302] ERROR [ReplicaFetcherThread-0-1001], Error for partition [__consumer_offsets,2] to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-09-14 22:12:40,302] ERROR [ReplicaFetcherThread-0-1001], Error for partition [__consumer_offsets,14] to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-09-14 22:12:40,302] ERROR [ReplicaFetcherThread-0-1001], Error for partition [__consumer_offsets,26] to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-09-14 22:13:41,764] ERROR [ReplicaFetcherThread-0-1003], Error for partition [__consumer_offsets,14] to broker 1003:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

Kafka broker 2:

[2017-09-14 22:13:41,767] ERROR [ReplicaFetcherThread-0-1001], Error for partition [player-events,13] to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

Producer:

2017/09/14 22:19:42 [error] 5445#0: kafka message delivery failed: Local: Message timed out
(many of these...)

Hope it will help figuring this one out

edenhill · 2017-09-15T07:44:06Z

It would be great if you could reproduce this on latest librdkafka master with debug=broker,topic,metadata,protocol

erankor · 2017-09-15T08:31:49Z

Ok, pulled master and enabled debug, will update this thread when it happens, thanks!

erankor · 2017-09-19T08:44:37Z

@edenhill, I made some changes to Kafka today -

Reduced the retention time on some of the topics with commands such as:
/opt/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic heartbeat --config retention.ms=14400000
Restarted all Kafka brokers, in order to change Kafka's GC logging settings

After I did that, the producers got stuck in a similar fashion to what happened before. I'm not entirely sure it's the same issue, since in the previous cases I reported, I didn't make any change to Kafka.
Pasting the log below, note that there are 4 producer processes, the process id appears in the fourth column (e.g. 24958#0:). Unfortunately, I don't see the process id on the debug messages written by rdkafka, so we can't really correlate them, but don't think it matters much in this case.

Thanks!

2017/09/19 06:30:24 [error] 24958#0: kafka error: broker1:9092/1004: Receive failed: Connection reset by peer
2017/09/19 06:30:24 [error] 24958#0: kafka error: broker1:9092/1004: Connection closed
2017/09/19 06:30:24 [error] 24958#0: kafka error: broker1:9092/1004: Receive failed: Connection reset by peer
2017/09/19 06:30:24 [error] 24958#0: kafka error: broker1:9092/1004: Connection closed
2017/09/19 06:30:24 [error] 24956#0: kafka error: broker1:9092/1004: Receive failed: Connection reset by peer
2017/09/19 06:30:24 [error] 24956#0: kafka error: broker1:9092/1004: Connection closed
2017/09/19 06:30:25 [error] 24956#0: kafka error: broker1:9092/bootstrap: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:25 [error] 24958#0: kafka error: broker1:9092/bootstrap: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:25 [error] 24958#0: kafka error: broker1:9092/1004: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:25 [error] 24956#0: kafka error: broker1:9092/1004: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:26 [error] 24957#0: kafka error: broker1:9092/1004: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:26 [error] 24957#0: kafka error: broker1:9092/bootstrap: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:33 [error] 24956#0: kafka error: broker2:9092/1001: Receive failed: Connection reset by peer
2017/09/19 06:30:33 [error] 24956#0: kafka error: broker2:9092/1001: Connection closed
2017/09/19 06:30:34 [error] 24956#0: kafka error: broker3:9092/1002: Receive failed: Connection reset by peer
2017/09/19 06:30:34 [error] 24956#0: kafka error: broker3:9092/1002: Connection closed
2017/09/19 06:30:38 [error] 24955#0: kafka error: broker1:9092/bootstrap: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:38 [error] 24955#0: kafka error: broker1:9092/1004: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:34:33 [error] 24958#0: kafka error: broker4:9092/1003: Receive failed: Connection reset by peer
2017/09/19 06:34:33 [error] 24958#0: kafka error: broker4:9092/1003: Connection closed
2017/09/19 06:34:33 [error] 24956#0: kafka error: broker4:9092/1003: Receive failed: Connection reset by peer
2017/09/19 06:34:33 [error] 24956#0: kafka error: broker4:9092/1003: Connection closed
2017/09/19 06:34:34 [error] 24958#0: kafka error: broker4:9092/bootstrap: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:34 [error] 24956#0: kafka error: broker4:9092/bootstrap: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:34 [error] 24957#0: kafka error: broker4:9092/bootstrap: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:34 [error] 24958#0: kafka error: broker4:9092/1003: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:34 [error] 24956#0: kafka error: broker4:9092/1003: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:34 [error] 24957#0: kafka error: broker4:9092/1003: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:47 [error] 24955#0: kafka error: broker4:9092/bootstrap: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:47 [error] 24955#0: kafka error: broker4:9092/1003: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:36:29 [error] 24956#0: kafka error: broker2:9092/1001: Receive failed: Connection reset by peer
2017/09/19 06:36:29 [error] 24956#0: kafka error: broker2:9092/1001: Connection closed
2017/09/19 06:36:29 [error] 24956#0: kafka error: broker2:9092/1001: Receive failed: Connection reset by peer
2017/09/19 06:36:29 [error] 24956#0: kafka error: broker2:9092/1001: Connection closed
2017/09/19 06:36:30 [error] 24956#0: kafka error: broker2:9092/1001: Receive failed: Connection reset by peer
2017/09/19 06:36:30 [error] 24956#0: kafka error: broker2:9092/1001: Connection closed
%4|1505803006.749|METADATA|nginx#producer-1| [thrd:main]: broker2:9092/1001: Metadata request failed: Local: Broker transport failure (1576447684ms)
%4|1505803006.757|METADATA|nginx#producer-1| [thrd:main]: broker2:9092/bootstrap: Metadata request failed: Local: Broker transport failure (1576447684ms)
2017/09/19 06:36:46 [error] 24956#0: kafka error: broker2:9092/bootstrap: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:46 [error] 24956#0: kafka error: broker2:9092/1001: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:46 [error] 24957#0: kafka error: broker2:9092/bootstrap: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:46 [error] 24957#0: kafka error: broker2:9092/1001: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:47 [error] 24958#0: kafka error: broker2:9092/1001: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:47 [error] 24958#0: kafka error: broker2:9092/bootstrap: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:47 [error] 24955#0: kafka error: broker2:9092/bootstrap: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:47 [error] 24955#0: kafka error: broker2:9092/1001: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:37:12 [error] 24958#0: kafka error: broker3:9092/1002: Receive failed: Connection reset by peer
2017/09/19 06:37:12 [error] 24958#0: kafka error: broker3:9092/1002: Connection closed
2017/09/19 06:37:12 [error] 24956#0: kafka error: broker3:9092/1002: Receive failed: Connection reset by peer
2017/09/19 06:37:12 [error] 24956#0: kafka error: broker3:9092/1002: Connection closed
2017/09/19 06:37:13 [error] 24958#0: kafka error: broker3:9092/1002: Receive failed: Connection reset by peer
2017/09/19 06:37:13 [error] 24958#0: kafka error: broker3:9092/1002: Connection closed
2017/09/19 06:37:13 [error] 24956#0: kafka error: broker3:9092/1002: Receive failed: Connection reset by peer
2017/09/19 06:37:13 [error] 24956#0: kafka error: broker3:9092/1002: Connection closed
2017/09/19 06:37:13 [error] 24958#0: kafka error: broker3:9092/1002: Receive failed: Connection reset by peer
2017/09/19 06:37:13 [error] 24958#0: kafka error: broker3:9092/1002: Connection closed
%4|1505803050.037|METADATA|nginx#producer-1| [thrd:main]: broker3:9092/1002: Metadata request failed: Local: Broker transport failure (1576490967ms)
%4|1505803050.038|METADATA|nginx#producer-1| [thrd:main]: broker3:9092/1002: Metadata request failed: Local: Broker transport failure (1576490973ms)
2017/09/19 06:37:30 [error] 24958#0: kafka error: broker3:9092/bootstrap: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:37:30 [error] 24956#0: kafka error: broker3:9092/1002: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:37:30 [error] 24958#0: kafka error: broker3:9092/1002: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:37:30 [error] 24956#0: kafka error: broker3:9092/bootstrap: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:37:30 [error] 24957#0: kafka error: broker3:9092/bootstrap: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:37:30 [error] 24957#0: kafka error: broker3:9092/1002: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:38:51 [error] 24955#0: kafka error: broker3:9092/bootstrap: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:38:51 [error] 24955#0: kafka error: broker3:9092/1002: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:40:01 [error] 24958#0: *1350031 failed to produce to topic player-events partition -1: Local: Queue full
... many of these
2017/09/19 06:41:47 [error] 24956#0: kafka message delivery failed: Local: Message timed out
... many of these

edenhill · 2017-09-25T09:18:30Z

Thank you, there's unfortunately nothing in these logs that gives any indication why the CPU usage would be so high.

dacjames · 2017-10-27T22:44:13Z

@edenhill We've experienced a bug similar to this one. It seems to only occur in the face of less than perfect (or at least high latency) network because we have been unable to reproduce it in testing.

The "protocol" debug produces a prohibitively large amount of logs for us to enable in production; would debug=broker,topic,metadata provide sufficient insight to pursue this issue further?

edenhill · 2017-10-28T07:28:23Z

@dacjames Yeah, that would be a good start. Thanks

dacjames · 2017-10-30T20:29:48Z

@edenhill Is there any additional information what would be helpful to gather? We can deploy custom builds with any additional instrumentation or logging if you have a guess as to where the problem might be.

…NNECT (#1397)

With a zero rkb_blocking_max_ms the ua_idle() function would not serve broker queues and IOs resulting in the broker thread practically hanging and busy-looping. This fix makes sure we serve it at least once per spin. The previous fix makes sure rkb_blocking_max_ms is not 0.

edenhill · 2017-11-06T20:24:33Z

PR #1505 should fix this.

Big thanks for your input, @erankor and @dacjames

erankor · 2017-11-06T20:26:52Z

That's great! thanks @edenhill !

With a zero rkb_blocking_max_ms the ua_idle() function would not serve broker queues and IOs resulting in the broker thread practically hanging and busy-looping. This fix makes sure we serve it at least once per spin. The previous fix makes sure rkb_blocking_max_ms is not 0.

…NNECT (confluentinc#1397)

dacjames · 2017-11-06T21:47:04Z

@edenhill Will test, thanks! We're still trying to get those logs, too; we should have them if this fix does not work.

…NNECT (#1397)

With a zero rkb_blocking_max_ms the ua_idle() function would not serve broker queues and IOs resulting in the broker thread practically hanging and busy-looping. This fix makes sure we serve it at least once per spin. The previous fix makes sure rkb_blocking_max_ms is not 0.

…NNECT (confluentinc#1397)

With a zero rkb_blocking_max_ms the ua_idle() function would not serve broker queues and IOs resulting in the broker thread practically hanging and busy-looping. This fix makes sure we serve it at least once per spin. The previous fix makes sure rkb_blocking_max_ms is not 0.

adityanahan · 2017-11-29T23:21:10Z

@edenhill , do we now close the connection if the producer is stuck due to some network glitches ?
I have faced this issue several times as well.

edenhill · 2017-11-30T07:59:10Z

The network connection will be closed as soon as a request times out.
For ProduceRequests this will be message.timeout.ms (by def 5m).
We're looking into changing the ProduceRequest timeout to socket.timeout.ms to separate the total message timeout (including queuing and retries) from the per-request/retry timeout (socket.timeout.ms).

abhit011 · 2017-12-04T10:45:59Z

#1575
Sorry to hijack the conversation, please see if its the same issue.. ?

Abhi

edenhill added bug wait-info labels Sep 15, 2017

edenhill added a commit that referenced this issue Nov 6, 2017

Reset rkb_blocking_max_ms on broker DOWN to avoid busy-loop during CO…

47aed30

…NNECT (#1397)

dacjames pushed a commit to Viasat/librdkafka that referenced this issue Nov 6, 2017

Reset rkb_blocking_max_ms on broker DOWN to avoid busy-loop during CO…

9de4c43

…NNECT (confluentinc#1397)

edenhill added a commit that referenced this issue Nov 6, 2017

Reset rkb_blocking_max_ms on broker DOWN to avoid busy-loop during CO…

074f880

…NNECT (#1397)

edenhill closed this as completed Nov 7, 2017

dacjames pushed a commit to Viasat/librdkafka that referenced this issue Nov 8, 2017

Reset rkb_blocking_max_ms on broker DOWN to avoid busy-loop during CO…

a9db93e

…NNECT (confluentinc#1397)

edenhill removed the wait-info label Nov 8, 2017

akhi3030 pushed a commit to akhi3030/librdkafka that referenced this issue Nov 13, 2017

Reset rkb_blocking_max_ms on broker DOWN to avoid busy-loop during CO…

37b7236

…NNECT (confluentinc#1397)

edenhill mentioned this issue Nov 13, 2017

Kafka client not able to connect to broker, after broker is abruptly shut down. #1511

Closed

9 tasks

edenhill mentioned this issue Nov 29, 2017

CPU utilization increases to 100% when broker is down/unavailable #1569

Closed

7 tasks

edenhill mentioned this issue Nov 30, 2017

Frequent "Receive failed: Disconnected" errors with latest version #1572

Closed

7 tasks

linshaoyong mentioned this issue Feb 17, 2020

kafka sink thread stuck in infinite loop vectordotdev/vector#1818

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Producer thread stuck in infinite loop #1397

Producer thread stuck in infinite loop #1397

erankor commented Aug 28, 2017

edenhill commented Aug 28, 2017

erankor commented Aug 28, 2017

erankor commented Aug 30, 2017

erankor commented Sep 10, 2017

erankor commented Sep 15, 2017

edenhill commented Sep 15, 2017

erankor commented Sep 15, 2017

erankor commented Sep 19, 2017 •

edited

Loading

edenhill commented Sep 25, 2017

dacjames commented Oct 27, 2017

edenhill commented Oct 28, 2017

dacjames commented Oct 30, 2017

edenhill commented Nov 6, 2017

erankor commented Nov 6, 2017

dacjames commented Nov 6, 2017

adityanahan commented Nov 29, 2017

edenhill commented Nov 30, 2017

abhit011 commented Dec 4, 2017

Producer thread stuck in infinite loop #1397

Producer thread stuck in infinite loop #1397

Comments

erankor commented Aug 28, 2017

Description

How to reproduce

Checklist

edenhill commented Aug 28, 2017

erankor commented Aug 28, 2017

erankor commented Aug 30, 2017

erankor commented Sep 10, 2017

erankor commented Sep 15, 2017

edenhill commented Sep 15, 2017

erankor commented Sep 15, 2017

erankor commented Sep 19, 2017 • edited Loading

edenhill commented Sep 25, 2017

dacjames commented Oct 27, 2017

edenhill commented Oct 28, 2017

dacjames commented Oct 30, 2017

edenhill commented Nov 6, 2017

erankor commented Nov 6, 2017

dacjames commented Nov 6, 2017

adityanahan commented Nov 29, 2017

edenhill commented Nov 30, 2017

abhit011 commented Dec 4, 2017

erankor commented Sep 19, 2017 •

edited

Loading