You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are seeing a deadlock situation with the Kafka consumer especially when the Group Coordinator is not reachable for a period of time larger than the request timeout. The heartbeat thread and the AsyncClient are waiting for each other to release the lock.
Based on the analysis, it appears there is a race condition for the deadlock to happen between the heartbeat thread and the Async client especially when the consumer group coordinator is not reachable beyond the maximum request timeout limit.
A combination of parallel execution of the heartbeat request, broker connection timeout and the cancellation of the any pending in-flight requests when the consumer group coordinator is not reachable could lead us to this situation.
1) Consumer is polling continuously
2) Heart Beat is failing (unable to talk to group coordinator) - coordinator is marked dead
3) Heartbeat session expired, marking coordinator dead
(00:45:55.498 -0700 to 00:49:56.374 -0700)
4) Auto offset commit failing continuously (NodeNotReadyError), while poll(..) is able to fetch records
- 00:45:10.520 -0700 to 00:50:03.251 -0700
5) Async client connection is closed after the request timeout is elapsed (timed out after 305000 ms. Closing connection)
- 00:50:03.253 -0700
Code Analysis
ConsumerGroup.Poll()
-> Consumer.Poll()
-> Activates Heart Beat Thread to send heart beat HeartBeat Thread
-> CommitOffsetAsync -> lock()
-> AsyncClient.poll() -> AsyncClient.poll()
-> lock() -> monitor.lock() (waiting for CommitOffsetAsync to release lock)
-> _poll()
-> conn.close(error) [ Close All Request Timeout Connections ]
-> Iterates In-Flight requests and associated callbacks
-> Executes Callbacks in loop
-> HandleHeartBeatFailure HandleHeartBeatFailure()
-> lock() --> (waiting for heartbeat thread to release the lock)
The text was updated successfully, but these errors were encountered:
vsrini-ns
changed the title
Consumer client running in to a deadlock when Group Coordinator is not reachable
Consumer client running in to a deadlock
Jul 11, 2020
Problem Details
We are seeing a deadlock situation with the Kafka consumer especially when the Group Coordinator is not reachable for a period of time larger than the request timeout. The heartbeat thread and the AsyncClient are waiting for each other to release the lock.
Based on the analysis, it appears there is a race condition for the deadlock to happen between the heartbeat thread and the Async client especially when the consumer group coordinator is not reachable beyond the maximum request timeout limit.
A combination of parallel execution of the heartbeat request, broker connection timeout and the cancellation of the any pending in-flight requests when the consumer group coordinator is not reachable could lead us to this situation.
Branch: Kafka-Python 1.4.7
Consumer Properties
From the Logs
Code Analysis
The text was updated successfully, but these errors were encountered: