Merge pull request #147 from Azure/hmlam-patch-1

Update CONFIGURATION.md
Azure · Mar 26, 2021 · b474fab · b474fab
2 parents 0fd45be + 8f31305
commit b474fab
Show file tree

Hide file tree

Showing 2 changed files with 2 additions and 0 deletions.
diff --git a/CONFIGURATION.md b/CONFIGURATION.md
@@ -71,3 +71,4 @@ Symptoms | Problem | Solution
 ----|---|-----
 Offset commit failures due to rebalancing | Your consumer is waiting too long in between calls to poll() and the service is kicking the consumer out of the group. | You have several options: <ul><li>increase session timeout</li><li>decrease message batch size to speed up processing</li><li>improve processing parallelization to avoid blocking consumer.poll()</li></ul> Applying some combination of the three is likely wisest.
 Network exceptions at high produce throughput | Are you using Java client + default max.request.size?  Your requests may be too large. | See Java configs above.
+Seeing frequent rebalancing becaues of frequent consumer leave group | Check your client side logs, and you should find the log saying "Member [some member-id] sending LeaveGroup request to coordinator [xyz] due to consumer poll timeout has expired". This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. | There are serveral settings you can tweak: <ul><li>Either increase max.poll.interval.ms (but then rebalance may take longer)</li><li>or speed up processing by reducing the maximum size of batches returned in poll() with max.poll.records (which may impact performance due to less batching)</li><li>or improve processing parallelization to avoid blocking consumer.poll() for too long</li></ul> Applying some combination of the three is likely necessary to get the best balance for your scenario.
diff --git a/README.md b/README.md
@@ -67,6 +67,7 @@ Dedicated clusters do not have throttling mechanisms - you are free to consume a
 There is no exception or error when this happens, but the Kafka logs will show that the consumers are stuck trying to re-join the group and assign partitions. There are a few possible causes:
 
  * Make sure that your `request.timeout.ms` is at least the recommended value of 60000 and your `session.timeout.ms` is at least the recommended value of 30000. Having these too low could cause consumer timeouts which then cause rebalances (which then cause more timeouts which then cause more rebalancing...) 
+ * Check your Kafka client side log and see if your consumers are issuing leave group command. The error message usually give you details as to why the consumers are leaving, but the most common issue is that the poll interval timed out - i.e. your processing logic spend too much time processing messages that the next poll() did not run within the specified max.poll.interval.ms time. 
  * If your configuration matches those recommended values, and you're still seeing constant rebalancing, feel free to open up an issue (make sure to include your entire configuration in the issue so we can help debug)!
 
 ### Compression / Message Format Version issue