-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Consumer deadlock in coordinator when heartbeat request timeout #1985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
any resolution? |
this problem almost happend once a day in k8s container enviornment. |
you meeting this also ?? |
Experiencing this problem as well with version 2.0.1 of the lib, any update? |
it works well with the down fix ways: delete OR add client._lock before get coordinator._lock at BaseCoordinator.ensure_active_group() and HeartbeatThread._run_once() in coordinator/base.py file. |
We are also seeing this issue using kafka-python version 2.0.1. Our heartbeat thread is stuck at client_async.py:574, while our MainThread is stuck at base.py:1009. |
I think I understand what is happening now. Thanks for all the notes and pings, and apologies that it is still not fixed! Though, as always, PRs and contributions are welcome! Here is what I see: The heartbeat thread sends a HeartbeatRequest. The request times out / does not receive a response / encounters a connection problem. Meanwhile, the Heartbeat thread continues to process its loop, holding the coordinator lock. The Main thread grabs the client lock and begins processing responses / connections. It sees that there was a networking problem on the connection that is waiting for a HeartbeatResponse. The networking issue prompts the main thread to close the connection. This continues while the Main thread is still holding the client lock, as networking and socket connections cannot be made concurrently -- the lock is required. At this point BrokerConnection.close() is called and the connection attempts to fail all pending in-flight-requests. In this case, one of these is the HeartbeatRequest. The failure is processed immediately, and the heartbeat_failure callback is executed. But this callback requires acquiring the coordinator lock, which the Heartbeat thread is holding. So the main thread waits. But now the Heartbeat thread attempts to trigger networking via a call to client.poll() . This blocks because it cannot acquire the client lock. So we have deadlock. There are a few ways to attack this problem. The first would be to attempt to defer future failure processing in the main thread / client until after the client lock is released. Our primary defense against deadlock right now is to enforce a strict lock acquisition order: coordinator lock -> client lock. (see #1821). We currently drop the client lock to process future.success callbacks for this reason. Unfortunately, connection errors are harder to isolate. There are some obvious places within client._poll where we call conn.close, and we could handle these directly. But there are other places within the BrokerConnection class that call self.close() during various other points in the connection lifecycle. To really stamp out this issue we would also need to have a way to collect failed ifr futures from all of these other places. Perhaps we need to add an interface to BrokerConnection that holds a dequeue of failed ifrs? Will need to think on this. A second way to attack this problem is to give the HeartbeatThread its own client and stop sharing the client code between the two threads. This is closer to what the java client does and would more effectively prevent deadlock because we would be eliminating the need for the threads to synchronize on the client lock (They would still need to synchronize on the coordinator lock). |
I would advice against using the first of these fixes (haven't tested the second one). Removing that line makes consumers in consumer groups fail with a "CommitFailedError". I assume this is because that line is essential in communicating that the consumer is alive. |
Adding some notes from my conversation with Dana on this:
|
@huangcuiyang Any luck with getting this pull request #2064 to be merged? |
@huangcuiyang while waiting for your fix to be merged and released, did you find a workaround ? |
@mjattiot: Do you surmise that wrapping the code block in a |
What are the alternatives now for this problem? |
All my consumers in group is disconnected, and i find that consumer process meet a deadlock in coordinator. i want to fix the deadlock by upgrade sdk version from 1.4.1 to 1.4.7, but it happed again...help me @dpkp
hung in coordinator/base.py:1006 _handle_heartbeat_failure().self.coordinator._lock
The text was updated successfully, but these errors were encountered: