Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix consumer race panic on close #529

Merged
merged 1 commit into from
Sep 1, 2015
Merged

Fix consumer race panic on close #529

merged 1 commit into from
Sep 1, 2015

Conversation

eapache
Copy link
Contributor

@eapache eapache commented Aug 31, 2015

I discovered a "send on closed channel" panic in the consumer while testing #527
which I was finally able to track down. If a partition takes a long time to
drain to the user, then the responseFeeder reclaims its ownership token from
the broker so that the broker doesn't block its other partitions. However, if
the user closes the PartitionConsumer (closing the dying channel) then the
brokerConsumer will unconditionally return the ownership token to the dispatcher
even if the responseFeeder is holding it. This results in two ownership tokens
for the same partition (one in the feeder, one in the dispatcher) which leads to
all sorts of subtle brokeness. It manifested in at least two different "send on
closed channel" backtraces depending on the exact timing, and possibly more.

To fix, move the check on child.dying to the top of the subscriptionConsumer
loop where we are guaranteed to have the ownership token. Combine that check
with the 'new subcriptions' check into an updateSubscriptions helper method.
The diff is huge because this lets us drop an indentation level in
handleResponses, I suggest reviewing with w=1 to ignore whitespace.

@Shopify/kafka

I discovered a "send on closed channel" panic in the consumer while testing #527
which I was finally able to track down. If a partition takes a long time to
drain to the user, then the responseFeeder reclaims its ownership token from
the broker so that the broker doesn't block its other partitions. However, if
the user closes the PartitionConsumer (closing the dying channel) then the
brokerConsumer will unconditionally return the ownership token to the dispatcher
even if the responseFeeder is holding it. This results in two ownership tokens
for the same partition (one in the feeder, one in the dispatcher) which leads to
all sorts of subtle brokeness. It manifested in at least two different "send on
closed channel" backtraces depending on the exact timing, and possibly more.

To fix, move the check on `child.dying` to the top of the `subscriptionConsumer`
loop where we are guaranteed to have the ownership token. Combine that check
with the 'new subcriptions' check into an `updateSubscriptions` helper method.
The diff is huge because this lets us drop an indentation level in
`handleResponses`, I suggest reviewing with `w=1` to ignore whitespace.
@eapache
Copy link
Contributor Author

eapache commented Aug 31, 2015

For reference, the interesting snippet of one of the backtraces:

panic: send on closed channel

goroutine 35 [running]:
github.com/Shopify/sarama.(*partitionConsumer).responseFeeder(0xc208066100)
    /Users/eapache/src/github.com/Shopify/sarama/consumer.go:422 +0x3fc
github.com/Shopify/sarama.*partitionConsumer.(github.com/Shopify/sarama.responseFeeder)·fm()
    /Users/eapache/src/github.com/Shopify/sarama/consumer.go:153 +0x27
github.com/Shopify/sarama.withRecover(0xc20809b310)
    /Users/eapache/src/github.com/Shopify/sarama/utils.go:42 +0x3a
created by github.com/Shopify/sarama.(*consumer).ConsumePartition
    /Users/eapache/src/github.com/Shopify/sarama/consumer.go:153 +0x533

In this case, the incorrect return of the token to the dispatcher caused the last remaining reference to the broker to be dropped. The broker then shuts down, and when the feeder tries to re-acquire it, the channel has been closed out from underneath.

@wvanbergen
Copy link
Contributor

Took me a while to grok what was causing the issue, but I get it now. Nice find 👍

@eapache
Copy link
Contributor Author

eapache commented Sep 1, 2015

Ya, it took me a long time to figure out as well. Normally once I get a backtrace it's pretty obvious, but I stared at this one for a long time and had to add all sorts of extra logging before I figured out how that channel ended up closed.

The design of the console consumer is interesting in that it starts consuming each partition serially, and doesn't start reading until that has finished, so all but the last handful of partitions time out and disassociate during startup. Good for testing this sort of thing :)

eapache added a commit that referenced this pull request Sep 1, 2015
Fix consumer race panic on close
@eapache eapache merged commit 9c6285c into master Sep 1, 2015
@eapache eapache deleted the fix-consumer-race branch September 1, 2015 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants