-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Client doesn't send heartbeat request, then marks coordinator as dead #2149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you try I haven't looked in detail, but there's a chance this was solved by #2064... |
We have migrated to confluent-kafka-python, which solved the problem. As the problem reproduces only in production environment and kafka is not under our maintenance we have no ability to repeat with 2.0.2. Sorry... We have heavy processing with neural networks and mxnet inside message processing. I thought that maybe mxnet acquires GIL that prevents heartbeat thread from running. Is there are a chance that GIL-blocking code violates kafka-python internal processes? And our problem doesn't look similar to #2064 as there is a complete deadlock, and in our case we have interruptions in heatbeat sending process. |
No problem, confluent-kafka-python wraps librdkafka and I believe does not hold the GIL while processing background tasks like heartbeats etc.
Yes, if you have a system that is holding the GIL for longer than 15secs then the heartbeat thread would never get scheduled and would cause this type of behavior. BUT, holding the GIL for that long seems a bit strange. Threads automatically release the GIL on IO, and CPU-bound threads will also automatically release the GIL after a set number of cpython instructions. I haven't looked closely at mxnet, but maybe mxnet is holding the GIL while pushing CPU load to an external library and so not triggering additional cpython instructions (which would prevent automatic GIL release)?
If mxnet is holding GIL for an extended period of time you would not expect to see any logs from any other system during that time. Is that what your logs show? |
Also worth noting that the |
Logs (some details, like ips is hidden or replaced to foobar):
In the logs above I see:
`Sending request HeartbeatRequest_v1(group='FOOBAR', generation_id=7240, member_id='FOOBAR_3-fa7e3d3c-5be1-489f-be03-ee9f7a016d04'))
Such situation repeats even if I keep single consumer in the group. I should to mention that kafka is accessed through the internet.
Consumer settings:
kafka-python version is 2.0.1
The text was updated successfully, but these errors were encountered: