Fix consumer race panic on close #529

eapache · 2015-08-31T18:57:13Z

I discovered a "send on closed channel" panic in the consumer while testing #527
which I was finally able to track down. If a partition takes a long time to
drain to the user, then the responseFeeder reclaims its ownership token from
the broker so that the broker doesn't block its other partitions. However, if
the user closes the PartitionConsumer (closing the dying channel) then the
brokerConsumer will unconditionally return the ownership token to the dispatcher
even if the responseFeeder is holding it. This results in two ownership tokens
for the same partition (one in the feeder, one in the dispatcher) which leads to
all sorts of subtle brokeness. It manifested in at least two different "send on
closed channel" backtraces depending on the exact timing, and possibly more.

To fix, move the check on child.dying to the top of the subscriptionConsumer
loop where we are guaranteed to have the ownership token. Combine that check
with the 'new subcriptions' check into an updateSubscriptions helper method.
The diff is huge because this lets us drop an indentation level in
handleResponses, I suggest reviewing with w=1 to ignore whitespace.

@Shopify/kafka

I discovered a "send on closed channel" panic in the consumer while testing #527 which I was finally able to track down. If a partition takes a long time to drain to the user, then the responseFeeder reclaims its ownership token from the broker so that the broker doesn't block its other partitions. However, if the user closes the PartitionConsumer (closing the dying channel) then the brokerConsumer will unconditionally return the ownership token to the dispatcher even if the responseFeeder is holding it. This results in two ownership tokens for the same partition (one in the feeder, one in the dispatcher) which leads to all sorts of subtle brokeness. It manifested in at least two different "send on closed channel" backtraces depending on the exact timing, and possibly more. To fix, move the check on `child.dying` to the top of the `subscriptionConsumer` loop where we are guaranteed to have the ownership token. Combine that check with the 'new subcriptions' check into an `updateSubscriptions` helper method. The diff is huge because this lets us drop an indentation level in `handleResponses`, I suggest reviewing with `w=1` to ignore whitespace.

eapache · 2015-08-31T19:03:20Z

For reference, the interesting snippet of one of the backtraces:

panic: send on closed channel

goroutine 35 [running]:
github.com/Shopify/sarama.(*partitionConsumer).responseFeeder(0xc208066100)
    /Users/eapache/src/github.com/Shopify/sarama/consumer.go:422 +0x3fc
github.com/Shopify/sarama.*partitionConsumer.(github.com/Shopify/sarama.responseFeeder)·fm()
    /Users/eapache/src/github.com/Shopify/sarama/consumer.go:153 +0x27
github.com/Shopify/sarama.withRecover(0xc20809b310)
    /Users/eapache/src/github.com/Shopify/sarama/utils.go:42 +0x3a
created by github.com/Shopify/sarama.(*consumer).ConsumePartition
    /Users/eapache/src/github.com/Shopify/sarama/consumer.go:153 +0x533

In this case, the incorrect return of the token to the dispatcher caused the last remaining reference to the broker to be dropped. The broker then shuts down, and when the feeder tries to re-acquire it, the channel has been closed out from underneath.

wvanbergen · 2015-09-01T09:36:40Z

Took me a while to grok what was causing the issue, but I get it now. Nice find 👍

eapache · 2015-09-01T11:42:09Z

Ya, it took me a long time to figure out as well. Normally once I get a backtrace it's pretty obvious, but I stared at this one for a long time and had to add all sorts of extra logging before I figured out how that channel ended up closed.

The design of the console consumer is interesting in that it starts consuming each partition serially, and doesn't start reading until that has finished, so all but the last handful of partitions time out and disassociate during startup. Good for testing this sort of thing :)

Fix consumer race panic on close

eapache added a commit that referenced this pull request Sep 1, 2015

Merge pull request #529 from Shopify/fix-consumer-race

9c6285c

Fix consumer race panic on close

eapache merged commit 9c6285c into master Sep 1, 2015

eapache deleted the fix-consumer-race branch September 1, 2015 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix consumer race panic on close #529

Fix consumer race panic on close #529

eapache commented Aug 31, 2015

eapache commented Aug 31, 2015

wvanbergen commented Sep 1, 2015

eapache commented Sep 1, 2015

Fix consumer race panic on close #529

Fix consumer race panic on close #529

Conversation

eapache commented Aug 31, 2015

eapache commented Aug 31, 2015

wvanbergen commented Sep 1, 2015

eapache commented Sep 1, 2015