Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Producer thread stuck in infinite loop #1397

Closed
5 of 9 tasks
erankor opened this issue Aug 28, 2017 · 18 comments
Closed
5 of 9 tasks

Producer thread stuck in infinite loop #1397

erankor opened this issue Aug 28, 2017 · 18 comments
Labels

Comments

@erankor
Copy link

erankor commented Aug 28, 2017

Hi,

Description

After running on production for some time, I'm seeing the librdkafka thread on 100% cpu.
Call stack is:

#0  rd_kafka_broker_ua_idle (rkb=rkb@entry=0x7f20fc008290, timeout_ms=0, timeout_ms@entry=-1) at rdkafka_broker.c:3316
#1  0x00007f21137b7afd in rd_kafka_broker_thread_main (arg=arg@entry=0x7f20fc008290) at rdkafka_broker.c:4731
#2  0x00007f21137fdc77 in _thrd_wrapper_function (aArg=<optimized out>) at tinycthread.c:624
#3  0x00007f2113c8f6ba in start_thread (arg=0x7f210de3e700) at pthread_create.c:333
#4  0x00007f21130433dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

From what I saw, rkb->rkb_state is RD_KAFKA_BROKER_STATE_CONNECT and rkb->rkb_blocking_max_ms is 0. So, rd_kafka_broker_ua_idle returns immediately, and rd_kafka_broker_thread_main doesn't really do anything in this state.
If I understand correctly, this means that librdkafka issued some connect request that got stuck for some reason, and it's polling for it indefinitely. If that is correct - is it possible to configure some timeout on connect?
Other than that, the log is full with kafka message delivery failed: Local: Message timed out, getting hundreds of these per second.
I'm leaving the process in this state for now, in case there any further info you want me to extract from it.

How to reproduce

Unfortunately, I have no way to reproduce this, it happens only after running for a long time in production.

Checklist

Please provide the following information:

  • librdkafka version (release number or git tag): I took master couple of months ago, I'm standing on commit 2adde00
  • Apache Kafka version: kafka_2.11-0.10.1.1
  • librdkafka client configuration: setting these parameters
bootstrap.servers = ...
request.required.acks = 0
compression.codec = snappy
log_level = 6
message.send.max.retries = 0
queue.buffering.max.messages = 100000
retry.backoff.ms = 10
  • Operating system: Ubuntu 16.04.2 LTS
  • Using the legacy Consumer
  • Using the high-level KafkaConsumer
  • Provide logs (with debug=.. as necessary) from librdkafka
    As mentioned above, getting hundreds of these per second - kafka message delivery failed: Local: Message timed out
  • Provide broker log excerpts
  • Critical issue

Thank you

Eran

@edenhill
Copy link
Contributor

Thank you, will try to reproduce.
I don't believe it is an infinite loop though, albeit a busy-loop. It will terminate when the instance is closed.

@erankor
Copy link
Author

erankor commented Aug 28, 2017

Correct :) thank you

@erankor
Copy link
Author

erankor commented Aug 30, 2017

Btw, I may be stating the obvious here, but just to clarify - I'm not too concerned about the busy loop, as I am about the fact messages are not being sent. Adding a sleep or something like that, will not solve the root problem.
IMHO the way to approach this is to add some timeout on connect. When the timeout expires, the library will close the socket and issue another connect.

@erankor
Copy link
Author

erankor commented Sep 10, 2017

@edenhill, did you get a chance to check it out?
I ended up restarting the process, but the problem reoccurred a couple of times since I opened this issue. I have 2 servers, and in the last occurrence, it happened only on one of them, so I think it's caused by some temporary network glitch.
Do you think my suggestion for adding a timeout on connect makes sense? If it does, I'll see if I can create a pull for it.

Thanks!

@erankor
Copy link
Author

erankor commented Sep 15, 2017

Some more info here - I added some monitoring to the system, and yesterday it happened again. The monitoring reported errors on a couple of Kafka brokers (we have 4 in total) few min before the issue started.

Kafka broker 1:

[2017-09-14 22:12:40,302] ERROR [ReplicaFetcherThread-0-1001], Error for partition [__consumer_offsets,38] to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-09-14 22:12:40,302] ERROR [ReplicaFetcherThread-0-1001], Error for partition [__consumer_offsets,2] to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-09-14 22:12:40,302] ERROR [ReplicaFetcherThread-0-1001], Error for partition [__consumer_offsets,14] to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-09-14 22:12:40,302] ERROR [ReplicaFetcherThread-0-1001], Error for partition [__consumer_offsets,26] to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-09-14 22:13:41,764] ERROR [ReplicaFetcherThread-0-1003], Error for partition [__consumer_offsets,14] to broker 1003:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

Kafka broker 2:

[2017-09-14 22:13:41,767] ERROR [ReplicaFetcherThread-0-1001], Error for partition [player-events,13] to broker 1001:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

Producer:

2017/09/14 22:19:42 [error] 5445#0: kafka message delivery failed: Local: Message timed out
(many of these...)

Hope it will help figuring this one out

@edenhill
Copy link
Contributor

It would be great if you could reproduce this on latest librdkafka master with debug=broker,topic,metadata,protocol

@erankor
Copy link
Author

erankor commented Sep 15, 2017

Ok, pulled master and enabled debug, will update this thread when it happens, thanks!

@erankor
Copy link
Author

erankor commented Sep 19, 2017

@edenhill, I made some changes to Kafka today -

  1. Reduced the retention time on some of the topics with commands such as:
    /opt/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic heartbeat --config retention.ms=14400000
  2. Restarted all Kafka brokers, in order to change Kafka's GC logging settings

After I did that, the producers got stuck in a similar fashion to what happened before. I'm not entirely sure it's the same issue, since in the previous cases I reported, I didn't make any change to Kafka.
Pasting the log below, note that there are 4 producer processes, the process id appears in the fourth column (e.g. 24958#0:). Unfortunately, I don't see the process id on the debug messages written by rdkafka, so we can't really correlate them, but don't think it matters much in this case.

Thanks!

2017/09/19 06:30:24 [error] 24958#0: kafka error: broker1:9092/1004: Receive failed: Connection reset by peer
2017/09/19 06:30:24 [error] 24958#0: kafka error: broker1:9092/1004: Connection closed
2017/09/19 06:30:24 [error] 24958#0: kafka error: broker1:9092/1004: Receive failed: Connection reset by peer
2017/09/19 06:30:24 [error] 24958#0: kafka error: broker1:9092/1004: Connection closed
2017/09/19 06:30:24 [error] 24956#0: kafka error: broker1:9092/1004: Receive failed: Connection reset by peer
2017/09/19 06:30:24 [error] 24956#0: kafka error: broker1:9092/1004: Connection closed
2017/09/19 06:30:25 [error] 24956#0: kafka error: broker1:9092/bootstrap: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:25 [error] 24958#0: kafka error: broker1:9092/bootstrap: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:25 [error] 24958#0: kafka error: broker1:9092/1004: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:25 [error] 24956#0: kafka error: broker1:9092/1004: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:26 [error] 24957#0: kafka error: broker1:9092/1004: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:26 [error] 24957#0: kafka error: broker1:9092/bootstrap: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:33 [error] 24956#0: kafka error: broker2:9092/1001: Receive failed: Connection reset by peer
2017/09/19 06:30:33 [error] 24956#0: kafka error: broker2:9092/1001: Connection closed
2017/09/19 06:30:34 [error] 24956#0: kafka error: broker3:9092/1002: Receive failed: Connection reset by peer
2017/09/19 06:30:34 [error] 24956#0: kafka error: broker3:9092/1002: Connection closed
2017/09/19 06:30:38 [error] 24955#0: kafka error: broker1:9092/bootstrap: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:30:38 [error] 24955#0: kafka error: broker1:9092/1004: Connect to ipv4#broker1:9092 failed: Connection refused
2017/09/19 06:34:33 [error] 24958#0: kafka error: broker4:9092/1003: Receive failed: Connection reset by peer
2017/09/19 06:34:33 [error] 24958#0: kafka error: broker4:9092/1003: Connection closed
2017/09/19 06:34:33 [error] 24956#0: kafka error: broker4:9092/1003: Receive failed: Connection reset by peer
2017/09/19 06:34:33 [error] 24956#0: kafka error: broker4:9092/1003: Connection closed
2017/09/19 06:34:34 [error] 24958#0: kafka error: broker4:9092/bootstrap: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:34 [error] 24956#0: kafka error: broker4:9092/bootstrap: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:34 [error] 24957#0: kafka error: broker4:9092/bootstrap: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:34 [error] 24958#0: kafka error: broker4:9092/1003: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:34 [error] 24956#0: kafka error: broker4:9092/1003: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:34 [error] 24957#0: kafka error: broker4:9092/1003: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:47 [error] 24955#0: kafka error: broker4:9092/bootstrap: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:34:47 [error] 24955#0: kafka error: broker4:9092/1003: Connect to ipv4#broker4:9092 failed: Connection refused
2017/09/19 06:36:29 [error] 24956#0: kafka error: broker2:9092/1001: Receive failed: Connection reset by peer
2017/09/19 06:36:29 [error] 24956#0: kafka error: broker2:9092/1001: Connection closed
2017/09/19 06:36:29 [error] 24956#0: kafka error: broker2:9092/1001: Receive failed: Connection reset by peer
2017/09/19 06:36:29 [error] 24956#0: kafka error: broker2:9092/1001: Connection closed
2017/09/19 06:36:30 [error] 24956#0: kafka error: broker2:9092/1001: Receive failed: Connection reset by peer
2017/09/19 06:36:30 [error] 24956#0: kafka error: broker2:9092/1001: Connection closed
%4|1505803006.749|METADATA|nginx#producer-1| [thrd:main]: broker2:9092/1001: Metadata request failed: Local: Broker transport failure (1576447684ms)
%4|1505803006.757|METADATA|nginx#producer-1| [thrd:main]: broker2:9092/bootstrap: Metadata request failed: Local: Broker transport failure (1576447684ms)
2017/09/19 06:36:46 [error] 24956#0: kafka error: broker2:9092/bootstrap: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:46 [error] 24956#0: kafka error: broker2:9092/1001: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:46 [error] 24957#0: kafka error: broker2:9092/bootstrap: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:46 [error] 24957#0: kafka error: broker2:9092/1001: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:47 [error] 24958#0: kafka error: broker2:9092/1001: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:47 [error] 24958#0: kafka error: broker2:9092/bootstrap: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:47 [error] 24955#0: kafka error: broker2:9092/bootstrap: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:36:47 [error] 24955#0: kafka error: broker2:9092/1001: Connect to ipv4#broker2:9092 failed: Connection refused
2017/09/19 06:37:12 [error] 24958#0: kafka error: broker3:9092/1002: Receive failed: Connection reset by peer
2017/09/19 06:37:12 [error] 24958#0: kafka error: broker3:9092/1002: Connection closed
2017/09/19 06:37:12 [error] 24956#0: kafka error: broker3:9092/1002: Receive failed: Connection reset by peer
2017/09/19 06:37:12 [error] 24956#0: kafka error: broker3:9092/1002: Connection closed
2017/09/19 06:37:13 [error] 24958#0: kafka error: broker3:9092/1002: Receive failed: Connection reset by peer
2017/09/19 06:37:13 [error] 24958#0: kafka error: broker3:9092/1002: Connection closed
2017/09/19 06:37:13 [error] 24956#0: kafka error: broker3:9092/1002: Receive failed: Connection reset by peer
2017/09/19 06:37:13 [error] 24956#0: kafka error: broker3:9092/1002: Connection closed
2017/09/19 06:37:13 [error] 24958#0: kafka error: broker3:9092/1002: Receive failed: Connection reset by peer
2017/09/19 06:37:13 [error] 24958#0: kafka error: broker3:9092/1002: Connection closed
%4|1505803050.037|METADATA|nginx#producer-1| [thrd:main]: broker3:9092/1002: Metadata request failed: Local: Broker transport failure (1576490967ms)
%4|1505803050.038|METADATA|nginx#producer-1| [thrd:main]: broker3:9092/1002: Metadata request failed: Local: Broker transport failure (1576490973ms)
2017/09/19 06:37:30 [error] 24958#0: kafka error: broker3:9092/bootstrap: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:37:30 [error] 24956#0: kafka error: broker3:9092/1002: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:37:30 [error] 24958#0: kafka error: broker3:9092/1002: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:37:30 [error] 24956#0: kafka error: broker3:9092/bootstrap: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:37:30 [error] 24957#0: kafka error: broker3:9092/bootstrap: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:37:30 [error] 24957#0: kafka error: broker3:9092/1002: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:38:51 [error] 24955#0: kafka error: broker3:9092/bootstrap: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:38:51 [error] 24955#0: kafka error: broker3:9092/1002: Connect to ipv4#broker3:9092 failed: Connection refused
2017/09/19 06:40:01 [error] 24958#0: *1350031 failed to produce to topic player-events partition -1: Local: Queue full
... many of these
2017/09/19 06:41:47 [error] 24956#0: kafka message delivery failed: Local: Message timed out
... many of these

@edenhill
Copy link
Contributor

Thank you, there's unfortunately nothing in these logs that gives any indication why the CPU usage would be so high.

@dacjames
Copy link
Contributor

@edenhill We've experienced a bug similar to this one. It seems to only occur in the face of less than perfect (or at least high latency) network because we have been unable to reproduce it in testing.

The "protocol" debug produces a prohibitively large amount of logs for us to enable in production; would debug=broker,topic,metadata provide sufficient insight to pursue this issue further?

@edenhill
Copy link
Contributor

@dacjames Yeah, that would be a good start. Thanks

@dacjames
Copy link
Contributor

@edenhill Is there any additional information what would be helpful to gather? We can deploy custom builds with any additional instrumentation or logging if you have a guess as to where the problem might be.

edenhill added a commit that referenced this issue Nov 6, 2017
With a zero rkb_blocking_max_ms the ua_idle() function would
not serve broker queues and IOs resulting in the broker thread practically
hanging and busy-looping.

This fix makes sure we serve it at least once per spin.
The previous fix makes sure rkb_blocking_max_ms is not 0.
@edenhill
Copy link
Contributor

edenhill commented Nov 6, 2017

PR #1505 should fix this.

Big thanks for your input, @erankor and @dacjames

@erankor
Copy link
Author

erankor commented Nov 6, 2017

That's great! thanks @edenhill !

dacjames pushed a commit to Viasat/librdkafka that referenced this issue Nov 6, 2017
With a zero rkb_blocking_max_ms the ua_idle() function would
not serve broker queues and IOs resulting in the broker thread practically
hanging and busy-looping.

This fix makes sure we serve it at least once per spin.
The previous fix makes sure rkb_blocking_max_ms is not 0.
dacjames pushed a commit to Viasat/librdkafka that referenced this issue Nov 6, 2017
@dacjames
Copy link
Contributor

dacjames commented Nov 6, 2017

@edenhill Will test, thanks! We're still trying to get those logs, too; we should have them if this fix does not work.

edenhill added a commit that referenced this issue Nov 6, 2017
With a zero rkb_blocking_max_ms the ua_idle() function would
not serve broker queues and IOs resulting in the broker thread practically
hanging and busy-looping.

This fix makes sure we serve it at least once per spin.
The previous fix makes sure rkb_blocking_max_ms is not 0.
@edenhill edenhill closed this as completed Nov 7, 2017
dacjames pushed a commit to Viasat/librdkafka that referenced this issue Nov 8, 2017
With a zero rkb_blocking_max_ms the ua_idle() function would
not serve broker queues and IOs resulting in the broker thread practically
hanging and busy-looping.

This fix makes sure we serve it at least once per spin.
The previous fix makes sure rkb_blocking_max_ms is not 0.
dacjames pushed a commit to Viasat/librdkafka that referenced this issue Nov 8, 2017
@edenhill edenhill removed the wait-info label Nov 8, 2017
akhi3030 pushed a commit to akhi3030/librdkafka that referenced this issue Nov 13, 2017
akhi3030 pushed a commit to akhi3030/librdkafka that referenced this issue Nov 13, 2017
With a zero rkb_blocking_max_ms the ua_idle() function would
not serve broker queues and IOs resulting in the broker thread practically
hanging and busy-looping.

This fix makes sure we serve it at least once per spin.
The previous fix makes sure rkb_blocking_max_ms is not 0.
@adityanahan
Copy link

@edenhill , do we now close the connection if the producer is stuck due to some network glitches ?
I have faced this issue several times as well.

@edenhill
Copy link
Contributor

The network connection will be closed as soon as a request times out.
For ProduceRequests this will be message.timeout.ms (by def 5m).
We're looking into changing the ProduceRequest timeout to socket.timeout.ms to separate the total message timeout (including queuing and retries) from the per-request/retry timeout (socket.timeout.ms).

@abhit011
Copy link

abhit011 commented Dec 4, 2017

#1575
Sorry to hijack the conversation, please see if its the same issue.. ?

Abhi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants