Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sarama consumer deadlock #2587

Merged
merged 3 commits into from
Nov 2, 2020

Conversation

albertteoh
Copy link
Contributor

@albertteoh albertteoh commented Oct 25, 2020

Signed-off-by: albertteoh albert.teoh@logz.io

Which problem is this PR solving?

Short description of the changes

Prevent deadlock between the handleErrors() and Close() functions trying to acquire the partitionLock and access the partition WaitGroup by:

  • Removing per-partition WaitGroups
  • Removing the need for acquiring the partitionLock within Close()

Additional improvements:

  • More correct ordering of close calls: Ensure underlying sarama consumer is closed first to prevent repartitioning, causing potentially new partition consumers to be created that attempt to write to already closed channels.
  • Instead, Close() will rely on Consumer-scoped WaitGroups to determine graceful closure of message and error handlers. These WaitGroups will be managed through goroutines monitoring messagesDone and errorsDone channels, which will be produced to by the respective handler functions.
  • Remove the redundant loop to close partition consumers which is already performed by closing the aforementioned underlying sarama consumer.
  • Additional diagnostic logging, mostly at debug, to help pickup race-related errors.
  • Sarama tests use the New Consumer constructor to DRY up duplicated construction logic.

Testing

  • Local unit tests have not failed any Sarama tests.
  • Confirm this failure no longer occurs on Travis branch builds (in fork) on Sarama tests.
  • Local integration tests with Kafka, Cassandra, collector, agent, query and mock trace generator running:
    • Start ingester binary ./cmd/ingester/ingester-darwin-amd64 --log-level debug:
      • Confirm successful startup.
      • Confirm span data consumed and showing correctly in Jaeger UI.
      • Confirm metrics look fine: rate(jaeger_kafka_spans_written_total[5m]) and rate(jaeger_ingester_sarama_consumer_messages_total[5m]) more or less line up.
    • Manually send SIGINT with Ctrl+C to trigger shutdown:
      • Confirm shutdown completes without hanging.
      • Confirm no Kafka consumer related error messages or panics present.
      • Test under two conditions: when there is 0 lag, and when the consumer is lagging behind; terminating while busy consuming messages to catch potential race conditions.
    • Force a race condition between Consumer start and shutdown to test potential race conditions:
      • Command: ./cmd/ingester/ingester-darwin-amd64 --log-level debug & pid1="$\!"; sleep 0.1; kill $pid1
      • Confirm shutdown attempted before consumer starts.
      • Confirm graceful shutdown completes successfully without error.
      • Output:
{"level":"debug","ts":1603594054.544082,"caller":"consumer/deadlock_detector.go:149","msg":"Global deadlock detector disabled"}
{"level":"info","ts":1603594054.5441122,"caller":"healthcheck/handler.go:128","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1603594054.544123,"caller":"flags/service.go:154","msg":"Shutting down"}
{"level":"info","ts":1603594054.544128,"caller":"healthcheck/handler.go:128","msg":"Health Check state change","status":"unavailable"}
{"level":"info","ts":1603594054.544121,"caller":"consumer/consumer.go:85","msg":"Starting main loop"}
{"level":"debug","ts":1603594054.544126,"caller":"consumer/consumer.go:107","msg":"Waiting for done signals"}
{"level":"info","ts":1603594054.5441332,"caller":"consumer/consumer.go:118","msg":"Closing parent consumer"}
{"level":"debug","ts":1603594054.544141,"caller":"consumer/consumer.go:107","msg":"Waiting for done signals"}
{"level":"info","ts":1603594054.55716,"caller":"consumer/consumer.go:196","msg":"Starting error handler","partition":0}
{"level":"info","ts":1603594054.557175,"caller":"consumer/consumer.go:138","msg":"Starting message handler","partition":0}
{"level":"debug","ts":1603594054.561001,"caller":"consumer/consumer.go:121","msg":"Closing deadlock detector"}
{"level":"info","ts":1603594054.7042592,"caller":"consumer/consumer.go:206","msg":"Finished handling errors","partition":0}
{"level":"debug","ts":1603594054.704276,"caller":"consumer/consumer.go:124","msg":"Waiting for messages and errors to be handled"}
{"level":"debug","ts":1603594054.704284,"caller":"consumer/consumer.go:109","msg":"Received done signal","msg":"HandleErrors done"}
{"level":"debug","ts":1603594054.704291,"caller":"consumer/deadlock_detector.go:100","msg":"Partition deadlock detector disabled"}
{"level":"info","ts":1603594054.704301,"caller":"consumer/consumer.go:163","msg":"Message channel closed. ","partition":0}
{"level":"info","ts":1603594054.704475,"caller":"consumer/consumer.go:187","msg":"Closing partition consumer","partition":0}
{"level":"info","ts":1603594054.704489,"caller":"consumer/consumer.go:190","msg":"Closed partition consumer","partition":0}
{"level":"debug","ts":1603594054.704499,"caller":"consumer/consumer.go:109","msg":"Received done signal","msg":"HandleMessages done"}
{"level":"debug","ts":1603594054.704504,"caller":"consumer/consumer.go:128","msg":"Closing message and error done channels"}
{"level":"info","ts":1603594055.207753,"caller":"flags/service.go:162","msg":"Shutdown complete"}

@albertteoh albertteoh requested a review from a team as a code owner October 25, 2020 05:54
@albertteoh albertteoh requested a review from vprithvi October 25, 2020 05:54
@codecov
Copy link

codecov bot commented Oct 25, 2020

Codecov Report

Merging #2587 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2587   +/-   ##
=======================================
  Coverage   95.06%   95.07%           
=======================================
  Files         209      209           
  Lines        9370     9364    -6     
=======================================
- Hits         8908     8903    -5     
+ Misses        387      385    -2     
- Partials       75       76    +1     
Impacted Files Coverage Δ
cmd/ingester/app/consumer/consumer.go 100.00% <100.00%> (ø)
cmd/query/app/static_handler.go 83.03% <0.00%> (-1.79%) ⬇️
plugin/storage/integration/integration.go 77.90% <0.00%> (+0.55%) ⬆️
...lugin/sampling/strategystore/adaptive/processor.go 100.00% <0.00%> (+0.92%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81761cd...b612f2c. Read the comment docs.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that this makes the logic even more complex by introducing additional signaling. My suggestion was to instead simplify it by reducing # of goroutines and handing more events in a single select{}, i.e. messages, errors, deadlock events. The errors handling is especially trivial (just a counter bump). Then perhaps with this refactoring the WaitGroup won't be necessary at all.

@albertteoh
Copy link
Contributor Author

albertteoh commented Oct 25, 2020

I am concerned that this makes the logic even more complex by introducing additional signaling. My suggestion was to instead simplify it by reducing # of goroutines and handing more events in a single select{}, i.e. messages, errors, deadlock events. The errors handling is especially trivial (just a counter bump). Then perhaps with this refactoring the WaitGroup won't be necessary at all.

Yeah, good call, agree it's more complex than necessary.

I think we can do away completely with additional goroutines and channels/signalling that this PR introduced, though I think a waitgroup (at consumer scope) is still necessary as we still want some mechanism to block Close() until messages and errors have been flushed/handled. So in this case, as you pointed out, the errors handling would bump the waitgroup counter.

If we use a single select, IIUC, would it not require introducing new channels to listen on messages, error events? Or maybe I've misunderstood something.

albertteoh added 2 commits October 26, 2020 09:07
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
@albertteoh albertteoh force-pushed the fix-flaky-sarama-chan branch from 592e1d0 to 409c72a Compare October 25, 2020 22:08
@@ -78,17 +79,11 @@ func (c *Consumer) Start() {
c.logger.Info("Starting main loop")
for pc := range c.internalConsumer.Partitions() {
c.partitionMapLock.Lock()
if p, ok := c.partitionIDToState[pc.Partition()]; ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is safe to remove? From the comment is seems like a case where a partition gets re-assigned to the same service instance, so there's a potentially another partition consumer for this partition ID already running (or more likely shutting down).

@vprithvi do you have any more color to add to this condition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We noticed the case that you described a couple of times, but weren't able to reproduce it in tests.
Since then, many things have changed with both Kafka and Sarama. I think we can remove this now in the interest of simplifying and revisit if see problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did it manifest when it happened? Committed offsets going backwards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this happens when new messages arrive on the partition (because the Messages() channel isn't closed yet) after the partition consumer was closed (via the c.closePartition(p.partitionConsumer) call), causing a new partition consumer to be created.

This PR proposes to solve this by calling c.internalConsumer.Close() first within Close() which closes the Messages() channel, preventing new messages from arriving, and then closes the Partitions() channel, preventing new PartitionConsumers from being created.

func (c *Consumer) handleMessages(pc sc.PartitionConsumer) {
c.logger.Info("Starting message handler", zap.Int32("partition", pc.Partition()))
c.partitionMapLock.Lock()
c.partitionsHeld++
c.partitionsHeldGauge.Update(c.partitionsHeld)
wg := &c.partitionIDToState[pc.Partition()].wg
c.partitionMapLock.Unlock()
defer func() {
c.closePartition(pc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this closePartition still necessary provided internalConsumer.Close() closes all partitions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I think you're right, @vprithvi. 👍

However, closePartition() also increments a metrics counter, which is used in tests to detect whether if the partition is closed or not; neither the Consumer nor PartitionConsumer interface appear to offer an "OnClosed" callback or channel to signal successful closure of a partition.

We could instead replace c.closePartition(pc) with c.partitionMetrics(pc.Partition()).closeCounter.Inc(1) to bump up the metric and remove the closePartition() function, if that sounds reasonable to you?

yurishkuro
yurishkuro previously approved these changes Nov 2, 2020
@mergify mergify bot dismissed yurishkuro’s stale review November 2, 2020 03:22

Pull request has been modified.

@yurishkuro yurishkuro merged commit 2a0d8cc into jaegertracing:master Nov 2, 2020
@yurishkuro
Copy link
Member

Thanks!

@albertteoh albertteoh deleted the fix-flaky-sarama-chan branch November 2, 2020 04:07
@albertteoh
Copy link
Contributor Author

Thanks for the review, @yurishkuro & @vprithvi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: TestSaramaConsumerWrapper_start_Messages
3 participants