-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer removed from group but keeps consuming and committing successfully #2631
Comments
I can further confirm that 2 hours after the consumer went rogue, the commits started to fail. This failure allowed our code to reconnect to kafka and joined the consumer group. Although this is a recovery, 2 hours is not really practical. If someone can suggest a better/faster recovery method, that would be greatly appreciated. |
When the heartbeat fails with ERR_NOT_COORDINATOR_FOR_GROUP the consumer will re-query for the coordinator but remain in an active consumer state. The eventual recovery after 2 hours would seem to indicate that the cluster state is somewhat mangled, or that there is an error code that the consumer needs to handle specifically. |
Thank you very much for your quick response, yes it was a pity I suppressed the other cgrp logs except for HEARTBEAT. Will reproduce with the full crpg logs, and update here. |
Great news, I enabled cgrp logs, and managed to get 1 consumer working and 1 consumer failing. I think I can see where the problem is. After the group coordinator swapped from 2 to 3, the logs show:
kafka01 is telling us GC is at 2 (but is actually at 3). This goes on for 20 seconds, and during this time, HEARTBEAT is getting "Not coordinator for group" since 2 is not the GC. For the next 2.5 minutes, it is happily consuming and committing:
Then, it gets another GC swap from kafka01, to go from 3 to 2:
and happily consuming and committing, while heartbeat is saying it is in the wrong GC. Then it gets interesting. It queries all 3 kafkas for GC, but getting different answers. kafka01 and kafka03 says it is 2, but kafka02 says it is 3:
This goes on forever, and in the meantime, heartbeat is still reporting the incorrect group coordinator. Now for the consumer which worked (did not get stuck), it also queried all 3 kafkas, and got the same conflicting replies as above, but it happened to join the correct GC and never had any issues after that. In the above, it joined the wrong GC in the middle and subsequently got stuck. Seems like a difficult problem to solve. Our workaround is to trap the HEARTBEAT incorrect group coordinator and after 45s, disconnect and reconnect to the server. It works, but it would be really great if the library can handle this transparently. |
I have re-tested this scenario with a kafka java client, and I can confirm it handles the case properly. Below are the logs:
You can observe that like before, both kafka01 and kafka02 responded saying kafka03 is the GC... the client initially accepted this information but needed to confirm it with a heatbeat to kafka03... with an invalid response, it rejected the GC and proceeded to rediscover the correct GC. When it made a call to kafka03, the response said the GC is kafka02, and upon a successful heartbeat the client can confirm the GC is correct. I have not looked at the source of librdkafka, but just wanted to report the findings first, and hoping for confirmation of the bug before going any further with the investigation. But based on the logs in the previous comment, I can assume librdkafka is not using the heartbeat response to validate the GC discovery call. |
Hi @edenhill what is the status of this issue, we believe that we were affected by this issue on one of our sandbox environments. We are using Confluent.Kafka dotnet wrapper at version 1.2.0 which targets librdkafka at same version. Our brokers are at 2.1.0 version. We have there Kafka and Zookeeper in HA (each deployed with 3 nodes). There are multiple consumer groups and topics. Consumer groups vary in size from 3 to 45 instances. In our case we must guarantee order across partition and at least once delivery. There was a maintenance action applied to a virtual hosts of Kafka and Zookeeper instances thus they were taken down within an interval one by one to perform the maintenance task. It turned out that the action lacked proper monitoring of cluster state and graceful shutdown was not not guaranteed to be fully awaited. After 3 days from maintenance action we detected that some of messages where processed in parallel causing race conditions in our application. After investigation it turned out that after this 3 days there was an active-but-detached instance of consumer group that was not active from Kafka cluster perspective. It was not listed by consumer-group API nor by CLI scripts. We checked logs of that instance and looking at logs, this instance believed that it is still active in a group and was still processing it's assignment (3-days old). In logs we found: We checked brokers logs and found that there was an issue reported: Sadly we had no option to turn on librd-debug logging without restart of instance. So cannot provide more details. Could it correlate with above? I'm adding hot reload of logging severity level within our Kafka abstraction layer, is there option hot-swap this logging options after consumer was created and if not, is there a significant performance footprint on your side if enable your trace level logs but will not process them unless required on my side? |
So to clarify situation we had competing consumers scenario in which two instances with the same consumer-group-id where reading and processing same partition. |
Hi @edenhill can you give some update on the status of this issue? In our system we strongly rely on the fact that Kafka will not allow for multiple consumers that identify themselves with the same consumer group ID to read the same partition. We implement CQRS pattern where we have separate microservice that owns Write Model and separate microservice that owns Read Model. Read Model is kept up to date by consuming events produced by Write Model microservice. Our event handlers are idempotent so we don't mind having duplicate events as long as there is no competing consumer scenario. Problems start when we have competing consumer scenario (within one consumer group). Let's say we have 10 events that indicate state change of a single entity. First competing instance processed all 10 events and updated Read Model to latest version of the entity. Then second competing instance started processing same 10 events but after processing 5 events it is being restarted or in some other way acknowledges that it should not be consuming same partition and stops consumption. As a result we have Read Model in inconsistent state (version 5 instead of 10). Those inconsistencies are very difficult to spot especially when we have tens of millions of documents in our Read Model. @mhowlett we are using C# client library which is a wrapper to librd so I don't know if you want to track this issue only in this repository or to create separate issue also in C# client repo. |
Any update ? @edenhill |
Thank you for the detailed analysis of the program, @keith-chew ! I believe the issue is that librdkafka solely relies on the group coordinator to enforce the group session timeout, so all that happens when the HEARTBEAT fails with ERR_NOT_COORDINATOR_FOR_GROUP is that librdkafka performs coordinator query (which may point to another broker) and continuation of the Heartbeats. This is a bug in librdkafka that we will fix for the upcoming v1.4.0 release. |
Amazing! Thank you very much for the update @edenhill...! |
If no successful Heartbeat has been sent in session.timeout.ms the consumer will trigger a local rebalance (rebalance callback with error code set to REVOKE_PARTITIONS). The consumer will rejoin the group when the rebalance has been handled.
If no successful Heartbeat has been sent in session.timeout.ms the consumer will trigger a local rebalance (rebalance callback with error code set to REVOKE_PARTITIONS). The consumer will rejoin the group when the rebalance has been handled.
If no successful Heartbeat has been sent in session.timeout.ms the consumer will trigger a local rebalance (rebalance callback with error code set to REVOKE_PARTITIONS). The consumer will rejoin the group when the rebalance has been handled.
If no successful Heartbeat has been sent in session.timeout.ms the consumer will trigger a local rebalance (rebalance callback with error code set to REVOKE_PARTITIONS). The consumer will rejoin the group when the rebalance has been handled.
If no successful Heartbeat has been sent in session.timeout.ms the consumer will trigger a local rebalance (rebalance callback with error code set to REVOKE_PARTITIONS). The consumer will rejoin the group when the rebalance has been handled.
@edenhill I am getting same issue with librdkafka 1.5.3 Rebalancing is happening and partition got assigned to another consumer, but this particular consumer is not able to join back the cluster again. |
Description
We have a case where the consumer was removed from the consumer group, but it kept on consuming and committing offsets successfully.
How to reproduce
Checklist
1.1.0
2.1.0
standard configuration
rhel7
debug=..
as necessary) from librdkafkaFrom client logs (it has been removed from server but still stayed with the server, no errors):
From server logs:
From kafka groups:
Finally confirmed client is still consuming without a consumer group:
Normally when this case happens, the commit would fail, and the client reconnects to kafka. But in this scenario, there are no errors and the client continues to consume. The critical issue is that when the other consumer starts up and joins a proper group, this rogue consumer will be processing duplicate messages! If we have 2 consumers, that is 50% more duplication messages, with 50% extra resources (CPU usage) used, which can be massive in a high throughput environment.
The text was updated successfully, but these errors were encountered: