panic: sync: negative WaitGroup counter #449

cep21 · 2015-05-14T07:07:02Z

Seeing this occasionally in production:

Revision: 23d5233

panic: sync: negative WaitGroup counter

goroutine 444 [running]:
sync.(*WaitGroup).Add(0xc208e6c510, 0xffffffffffffffff)
        /usr/src/go/src/sync/waitgroup.go:67 +0x96
sync.(*WaitGroup).Done(0xc208e6c510)
        /usr/src/go/src/sync/waitgroup.go:85 +0x31
github.com/Shopify/sarama.(*asyncProducer).returnSuccesses(0xc208e6c4d0, 0xc29ffe9600, 0x3f, 0x40)
        /opt/goproj/Godeps/_workspace/src/github.com/Shopify/sarama/async_producer.go:753 +0xd3
github.com/Shopify/sarama.(*asyncProducer).flusher(0xc208e6c4d0, 0xc20814c540, 0xc2080b9800)
        /opt/goproj/Godeps/_workspace/src/github.com/Shopify/sarama/async_producer.go:560 +0xafe
github.com/Shopify/sarama.func·006()
        /opt/goproj/Godeps/_workspace/src/github.com/Shopify/sarama/async_producer.go:430 +0x43
github.com/Shopify/sarama.withRecover(0xc20802b220)
        /opt/goproj/Godeps/_workspace/src/github.com/Shopify/sarama/utils.go:42 +0x3a
created by github.com/Shopify/sarama.(*asyncProducer).messageAggregator
        /opt/goproj/Godeps/_workspace/src/github.com/Shopify/sarama/async_producer.go:430 +0x205

The text was updated successfully, but these errors were encountered:

wvanbergen · 2015-05-14T10:39:49Z

Is this while running normally, or when closing the producer?

eapache · 2015-05-14T11:22:11Z

What configuration are you running with? What do you see in the sarama logs when this occurs? I took a quick look and didn't see any obvious cases where the counter could get out of sync.

cep21 · 2015-05-14T17:10:28Z

This is between 3 and 8 minutes of running in production normally. We have two tiers of production traffic and only one was seeing this behavior (the one that gets the most traffic).

There is no information in stdout or stderr. There is nothing in the logs other than messages of the form "client/metadata fetching metadata for all topics from broker kafka17.blah.blah:9092", "Connected to broker at kafka18--bbac.blah:9092 (registered as #10)" and "producer/flusher/12 starting up"

We have reverted to version 3534171 and things are working great now

eapache · 2015-05-14T17:19:56Z

What values were you setting in the sarama.Config?
There is really nothing else in the logs immediately prior to the panic? No lines starting with producer/leader/ or producer/flusher/?
What version of Go are you using?

cep21 · 2015-05-14T17:55:55Z

Config:

    // Not pointers so we can deep copy the config
    conf := f.defaultSaramaConfig

    conf.ClientID = f.clientID

    conf.Net.DialTimeout = time.Second * 30

    conf.ChannelBufferSize = 256

    conf.Producer.RequiredAcks = sarama.WaitForLocal
    conf.Producer.Return.Errors = true
    conf.Producer.Return.Successes = true
    conf.Producer.Partitioner = partitioner
    conf.Producer.Flush.Frequency = time.Millisecond * 250
    conf.Producer.Flush.Messages = 3000
    conf.Producer.Flush.Bytes = 83886080

I see earlier lines of the form producer/flusher/12 starting up but none of the regex leader

< go version
go version go1.4.2 darwin/amd64

eapache · 2015-05-14T18:13:58Z

I have gone through the producer code and double-checked every case where we touch the wait group in question; they all appear sane.

It would not terribly surprise me if there were some subtle bug in the retry logic (that is a fairly complex state machine), but anything touching that code path would generate logs which you are not seeing. The happy path in this case is much simpler and easily verified.

I suppose this could be a bug with Go's waitgroup code, but that seems unlikely; it is widely used, and you are using a normal stable version of Go.

How much traffic is seen at the tier where this occurs? We are running 68beb72 (v1.4.0) built with Go 1.4.2 on several hundred nodes, some of which are handling fairly substantial message throughput. If this was a common issue for you I'm very surprised we haven't run into it yet.

I'm at a bit of a loss; could you provide the traces of any other sarama goroutines from the panic output? Maybe they contain a clue...

wvanbergen · 2015-05-14T18:24:04Z

In your sarama logs, do you see the producer retrying messages? A log snipper of a run that ends up with a crash may be useful.

cep21 · 2015-05-14T18:40:06Z

For what it's worth, the new version was working on some of our kafka tiers, just not one of them (where it got the panic).

I can email you our full stack trace from the panic, but don't want to put it on a public ticket.

eapache · 2015-05-14T18:42:19Z

Please email me then. My email is on my github profile.

eapache · 2015-05-14T18:43:17Z

OK, I'm gonna take a step back and walk through my logic: the counter going negative must mean one of the following three cases:

a bug in the counter itself or a memory corruption or something like that
we are decrementing it when we shouldn't (e.g. multiple times for a single message)
we are not incrementing it when we should (e.g. missing a message that is input)

The first case is not one we can realistically debug, and it is so unlikely I'm just going to ignore it.

For the second case, there are a number of cases I have checked, but it's not trivially verifiable. I'll come back to this one.

For the third case, the logic is pretty simple. The only case we might theoretically be missing a message is if it is submitted to the input channel with the retries counter already non-zero. How are you generating your ProducerMessages, just with a literal or new call? Do you reuse ones that come out as successes or errors, or are all input messages allocated fresh?

cep21 · 2015-05-14T18:44:18Z

Hi, I emailed you

eapache · 2015-05-14T18:47:24Z

Email received, thanks. I'm updating my comment (which I accidentally submitted only half-typed) with a few other thoughts.

cep21 · 2015-05-14T18:51:38Z

I never reuse a ProducerMessage.

Callers of my API put messages on a channel and return.

One goroutine drains that channel as fast as it can and puts messages on kafka k.producer.Input() <- messageToSend. Another goroutine drains k.producer.Errors() and k.producer.Successes() to process results

eapache · 2015-05-14T19:14:15Z

OK, a few notes from diving into the logs:

This is definitely a "happy-path" problem. The retry code has been sleeping for 19 minutes when the panic occurred.
The goroutines are all spawned with the correct arguments (so this isn't a recurrence of Successes() channel returning &MessageToSend with flags:2 after requeue #179)

eapache · 2015-05-14T19:15:38Z

I assume you haven't seen anything come out of the Errors channel immediately before the panic?

eapache · 2015-05-14T19:21:24Z

I'm afraid I'm out of coherent ideas on this one... maybe I'll have a brainwave overnight.

cep21 · 2015-05-14T19:26:37Z

Nothing in the logs from the error channel. We do kafka rebalanced at times. Would a misbehaving server cause this issue?

eapache · 2015-05-14T19:43:09Z

Would a misbehaving server cause this issue?

I wouldn't think so, all the error cases for malformed responses etc. are handled.

I'm honestly baffled by this issue; I've not been able to reproduce it, and I can't even think of additional logging or checks that might help me track it down. If I think of anything I'll let you know.

cep21 · 2015-05-18T17:17:36Z

I've been thinking about this code some more and wonder if the WaitGroup logic should be reverted. I've only used WaitGroup when the use case was dead simple: Add followed by goroutine defer Done(). It's something easy to follow and prove.

The WaitGroup use inside async_producer is not so simple to follow. Given that any errors in this logic, either now or 6 months later when the code has been refactored a few times, will result in unrecoverable panics that kill the entire application for the user, and the wait group code isn't strictly necessary, what are your thoughts on not using WaitGroup at all?

On another note, most WaitGroup use uses defer Done() which makes sure the done is called in a panic. Is it possible something panic() somewhere and caused either an Add() not to happen or Done() to happen twice?

eapache · 2015-05-18T17:36:12Z

The WaitGroup logic was a fairly elegant fix for #419 and as per the comments in #420 it is actually simpler to reason about than the code that was present before (which is not to say it is simple, just simpler).

I'd be happy to find an even simpler solution which is not so fragile, but I don't know of one off the top of my head. When you say "the wait group code isn't strictly necessary" you seem to have one in mind - what is it?

Is it possible something panic() somewhere and caused either an Add() not to happen or Done() to happen twice?

No; unless you set sarama.PanicHandler we don't rescue any panics. If some earlier panic had happened, your program would have crashed at that point and the WaitGroup being out-of-sync wouldn't have even come up.

eapache · 2015-09-05T16:25:07Z

@cep21 I'm curious if you've continued to see this, or if you've simply avoided any release after 1.3? We still have not seen anything like it and I still have no clue how it could be happening.

cep21 · 2015-09-07T01:23:36Z

We haven't upgraded b/c of this issue. I'm going to fork sarama and replace the waitgroup with a debug log that doesn't panic. I'll reply back if I can get it to log in production again.

cep21 · 2015-09-21T18:08:39Z

The fork we're deploying this week is at https://github.com/cep21/sarama (https://github.com/cep21/sarama/commit/da1ad6b748a2e3fbcda969e1e10ea1c9b2b16272) I will update if I can see it logging again.

wvanbergen · 2015-09-21T18:09:38Z

Thanks for investigating this 👍

cep21 · 2015-09-23T21:30:50Z

This is still an issue.

Using https://github.com/cep21/sarama/commit/da1ad6b748a2e3fbcda969e1e10ea1c9b2b16272 within minutes of a deploy I see Had to recover from sarama state machine error: sync: negative WaitGroup counter

Do you want me to email another stack trace?

cep21 · 2015-09-23T21:53:17Z

The problem isn't reproducible on small amounts of traffic: only in production. I wonder if there is some kind of race or gotcha when it suddenly sees a large amount of traffic. Something else special about our deployment: We are using a larger than normal number of partitions (128)

eapache · 2015-09-24T13:38:23Z

Hmmm... what happens if you build and run the sample program at https://play.golang.org/p/t54O0eW2DI with your production settings (# brokers, # partitions, go compiler, etc)? It simulates the producer's use of the WaitGroup in a way that is trivially correct, so if it panics for you as well then we will know there is something wrong deeper in your toolchain. It should just run forever with no output.

eapache · 2015-09-24T13:41:25Z

I don't think another stack trace would be too useful, but I'd be interested in seeing the complete sarama logs for a panicing run (right from Initializing new client at the beginning up to the Had to recover from sarama state machine error...).

cep21 · 2015-09-24T17:42:58Z

Building the playground link with our production settings on our production machine shows no output. The are AWS instances. It feels unlikely that only this use of WaitGroup is broken due to a toolchain issue. I'll look at getting logs

cep21 · 2015-09-24T19:00:52Z

The logs are nothing interesting (this paste may appear a bit out of order b/c our logs are second resolution so they sometimes don't sort right).

Connected to broker at kafka22--bbaa.int.ourcorp.com:9092 (unregistered)
 client/metadata fetching metadata for all topics from broker kafka22--bbaa.int.ourcorp.com:9092
 Initializing new client
client/brokers registered new broker #10 at kafka26--bbac.int.ourcorp.com:9092
client/brokers registered new broker #6 at kafka25--bbab.int.ourcorp.com:9092
client/brokers registered new broker #20 at kafka33--bbab.int.ourcorp.com:9092
client/brokers registered new broker #14 at kafka22--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #9 at kafka17--bbab.int.ourcorp.com:9092
client/brokers registered new broker #21 at kafka34--bbac.int.ourcorp.com:9092
client/brokers registered new broker #12 at kafka20--bbab.int.ourcorp.com:9092
client/brokers registered new broker #17 at kafka30--bbab.int.ourcorp.com:9092
client/brokers registered new broker #13 at kafka24--bbac.int.ourcorp.com:9092
client/brokers registered new broker #18 at kafka31--bbac.int.ourcorp.com:9092
client/brokers registered new broker #7 at kafka36--bbac.int.ourcorp.com:9092
client/brokers registered new broker #19 at kafka32--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #8 at kafka16--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #16 at kafka29--bbaa.int.ourcorp.com:9092
Successfully initialized new client
client/brokers registered new broker #15 at kafka37--bbaa.int.ourcorp.com:9092
Initializing new client
client/brokers registered new broker #14 at kafka22--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #9 at kafka17--bbab.int.ourcorp.com:9092
client/brokers registered new broker #18 at kafka31--bbac.int.ourcorp.com:9092
client/metadata fetching metadata for all topics from broker kafka24--bbac.int.ourcorp.com:9092
 client/brokers registered new broker #20 at kafka33--bbab.int.ourcorp.com:9092
client/brokers registered new broker #13 at kafka24--bbac.int.ourcorp.com:9092
client/brokers registered new broker #17 at kafka30--bbab.int.ourcorp.com:9092
Connected to broker at kafka24--bbac.int.ourcorp.com:9092 (unregistered)
 client/brokers registered new broker #6 at kafka25--bbab.int.ourcorp.com:9092
client/brokers registered new broker #12 at kafka20--bbab.int.ourcorp.com:9092
client/brokers registered new broker #10 at kafka26--bbac.int.ourcorp.com:9092
client/brokers registered new broker #21 at kafka34--bbac.int.ourcorp.com:9092
client/brokers registered new broker #7 at kafka36--bbac.int.ourcorp.com:9092
Successfully initialized new client
client/brokers registered new broker #16 at kafka29--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #8 at kafka16--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #19 at kafka32--bbaa.int.ourcorp.com:9092
Initializing new client
client/brokers registered new broker #15 at kafka37--bbaa.int.ourcorp.com:9092
client/metadata fetching metadata for all topics from broker kafka24--bbac.int.ourcorp.com:9092
 Connected to broker at kafka24--bbac.int.ourcorp.com:9092 (unregistered)
 client/brokers registered new broker #10 at kafka26--bbac.int.ourcorp.com:9092
client/brokers registered new broker #14 at kafka22--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #6 at kafka25--bbab.int.ourcorp.com:9092
client/brokers registered new broker #20 at kafka33--bbab.int.ourcorp.com:9092
client/brokers registered new broker #21 at kafka34--bbac.int.ourcorp.com:9092
client/brokers registered new broker #9 at kafka17--bbab.int.ourcorp.com:9092
client/brokers registered new broker #13 at kafka24--bbac.int.ourcorp.com:9092
client/brokers registered new broker #17 at kafka30--bbab.int.ourcorp.com:9092
client/brokers registered new broker #7 at kafka36--bbac.int.ourcorp.com:9092
client/brokers registered new broker #12 at kafka20--bbab.int.ourcorp.com:9092
client/brokers registered new broker #16 at kafka29--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #18 at kafka31--bbac.int.ourcorp.com:9092
client/brokers registered new broker #15 at kafka37--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #19 at kafka32--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #8 at kafka16--bbaa.int.ourcorp.com:9092
Successfully initialized new client
producer/flusher/17 starting up
 producer/flusher/21 starting up
 producer/flusher/18 starting up
 producer/flusher/16 starting up
 producer/flusher/20 starting up
 producer/flusher/19 starting up
 Connected to broker at kafka29--bbaa.int.ourcorp.com:9092 (registered as #16)
 Connected to broker at kafka31--bbac.int.ourcorp.com:9092 (registered as #18)
 Connected to broker at kafka34--bbac.int.ourcorp.com:9092 (registered as #21)
 Connected to broker at kafka30--bbab.int.ourcorp.com:9092 (registered as #17)
 Connected to broker at kafka33--bbab.int.ourcorp.com:9092 (registered as #20)
 Connected to broker at kafka32--bbaa.int.ourcorp.com:9092 (registered as #19)
 sync: negative WaitGroup counter

cep21 · 2015-09-24T19:01:59Z

There is about a two minute delay between Connected to broker at kafka32--bbaa.int.ourcorp.com:9092 (registered as #19) and sync: negative WaitGroup counter

eapache · 2015-09-24T20:07:36Z

Do you see any messages out the Errors() channel during one of these runs?

edit: turns out I asked this back in May and you said no, so I assume it hasn't changed.

cep21 · 2015-09-24T20:12:04Z

We have a few attempted kafka messages that are unencodable. They return a error during Encode() ([]byte, error) on purpose and we see those errors out the other end and log them.

eapache · 2015-09-24T20:12:41Z

We have a few attempted kafka messages that are unencodable. They return a error during Encode() ([]byte, error) on purpose and we see those errors out the other end and log them.

That's what I needed to know. The bug is blindingly obvious now. I'll have a fix shortly.

cep21 · 2015-09-24T20:18:14Z

I see it too, I think. parseResponse() could see a message that already had an error chan sent when an error happens earlier?

eapache · 2015-09-24T20:20:40Z

Ya, buildRequest doesn't remove messages which fail encoding from the batch.

Pending a proper refactor and tests etc.

cep21 · 2015-09-24T20:27:30Z

I think it would be a bug for returnError() or returnSuccesses() to operate on a msg twice. Do you have ideas on how to prevent this kind of thing in the future? Is there a pattern or way to structure the code such that returnError() or returnSuccess() end the lifetime of a message? Ideally the message itself would enter a state machine that terminates in one of those two nodes, but that may be difficult with all the batch processing.

eapache · 2015-09-24T20:36:05Z

For the most part the message already flows through a state machine that terminates on one of those nodes. It's only in the final step where we wrap messages up into requests when that is no longer true (since as you point out that requires batching), and that's where the bug was :/

I'm going to have to sit down and think about the architecture; I think the final "correct" solution may involve more refactoring than I'd like.

Pending a proper refactor and tests etc.

eapache added bug producer labels May 14, 2015

eapache added the needs-more-info label May 14, 2015

eapache removed the needs-more-info label May 14, 2015

wvanbergen mentioned this issue May 14, 2015

Ensure we always have called Add() on the inflight counter before we Wait() for it #450

Merged

eapache added a commit that referenced this issue Sep 24, 2015

Quick hacky fix for #449

ce5f6fe

Pending a proper refactor and tests etc.

eapache mentioned this issue Sep 24, 2015

Correctly handle unencodable messages in the producer #538

Merged

eapache added a commit that referenced this issue Sep 24, 2015

Quick hacky fix for #449

8b5d6d8

Pending a proper refactor and tests etc.

eapache closed this as completed in 2840a37 Sep 25, 2015

eapache mentioned this issue Sep 25, 2015

Construct produce requests incrementally #433

Closed

eapache mentioned this issue Nov 13, 2015

Memory leak sometimes when restarting kafka partitions #560

Closed

eapache mentioned this issue Jul 4, 2017

panic: sync: negative WaitGroup counter #909

Closed

panic: sync: negative WaitGroup counter #449

panic: sync: negative WaitGroup counter #449

Comments

cep21 commented May 14, 2015

wvanbergen commented May 14, 2015

eapache commented May 14, 2015

cep21 commented May 14, 2015

eapache commented May 14, 2015

cep21 commented May 14, 2015

eapache commented May 14, 2015

wvanbergen commented May 14, 2015

cep21 commented May 14, 2015

eapache commented May 14, 2015

eapache commented May 14, 2015

cep21 commented May 14, 2015

eapache commented May 14, 2015

cep21 commented May 14, 2015

eapache commented May 14, 2015

eapache commented May 14, 2015

eapache commented May 14, 2015

cep21 commented May 14, 2015

eapache commented May 14, 2015

cep21 commented May 18, 2015

eapache commented May 18, 2015

eapache commented Sep 5, 2015

cep21 commented Sep 7, 2015

cep21 commented Sep 21, 2015

wvanbergen commented Sep 21, 2015

cep21 commented Sep 23, 2015

cep21 commented Sep 23, 2015

eapache commented Sep 24, 2015

eapache commented Sep 24, 2015

cep21 commented Sep 24, 2015

cep21 commented Sep 24, 2015

cep21 commented Sep 24, 2015

eapache commented Sep 24, 2015

cep21 commented Sep 24, 2015

eapache commented Sep 24, 2015

cep21 commented Sep 24, 2015

eapache commented Sep 24, 2015

cep21 commented Sep 24, 2015

eapache commented Sep 24, 2015