-
-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock producing during shutdown #831
Comments
some thoughts: this chan recv seems to be where we're mostly blocked Line 487 in b77dd13
and it should be closed by this goroutine above Line 471 in b77dd13
which is normally blocked on this condvar wait Line 475 in b77dd13
which is woken up right before the chan recv above Line 486 in b77dd13
but the condvar wait is blocked on re-locking the mutex, per the stack trace above. so we are blocked on shutdown until whoever has that lock gives it up. who has it? |
We now no longer believe that the kafka cluster's unavailability was related, as that was taken offline several days into this deadlock. |
I think I see the problem. The goroutine leaves the mutex locked (for reasons). If it runs and returns and closes I'll confirm if this is the problem, fix it, and add a test case for this. |
diff --git a/pkg/kgo/producer.go b/pkg/kgo/producer.go
index d9cca99..1ea2a29 100644
--- a/pkg/kgo/producer.go
+++ b/pkg/kgo/producer.go
@@ -480,12 +480,24 @@ func (cl *Client) produce(
}()
drainBuffered := func(err error) {
- p.mu.Lock()
- quit = true
+ // The expected case here is that a context was
+ // canceled while we we waiting for space, so we are
+ // exiting and need to kill the goro above.
+ //
+ // However, it is possible that the goro above has
+ // already exited AND the context was canceled, and
+ // `select` chose the context-canceled case.
+ //
+ // So, to avoid a deadlock, we need to wakeup the
+ // goro above in another goroutine.
+ go func() {
+ p.mu.Lock()
+ quit = true
+ p.mu.Unlock()
+ p.c.Broadcast()
+ }()
+ <-wait // we wait for the goroutine to exit, then unlock again (since the goroutine leaves the mutex locked)
p.mu.Unlock()
- p.c.Broadcast() // wake the goroutine above
- <-wait
- p.mu.Unlock() // we wait for the goroutine to exit, then unlock again (since the goroutine leaves the mutex locked)
p.promiseRecordBeforeBuf(promisedRec{ctx, promise, r}, err)
} is what I'm thinking |
thanks for the quick response @twmb! i'm trying to understand your proposed fix right now.. |
tldr summary is:
We can wake up concurrently |
how does the spawned goroutine exit if |
Standard case, |
I see, thanks. I think that makes sense. |
@twmb do you have an estimate on how long it will take to fix this issue? I really appreciate your work on it -- just want to get an estimate so i can make decisions on how to move forward on our end. I'm working on a repro unit test in this repository currently as well, and i'll let you know if that bears fruit |
@twmb i have a reliable repro unit test here: https://github.com/asg0451/franz-go/blob/7fe24ae1ac0bf8150a18c09459509d014a9f05e7/pkg/kgo/produce_request_test.go#L624C1-L724C2 - it usually fails after a matter of seconds. significantly, this never fails when each worker thread has its own client. Produce/ProduceSync is supposed to be thread-safe, right? EDIT: i applied the patch you posted above and the test seems to not be failing with it |
PR #832 also has a reliable reproducer -- I had this in a branch, sorry you put in the effort as well (the test in 832 is a bit smaller, hope it was fun to write 😅) I'm going to work on this repo the rest of the day, trying to fix some other misc more minor problems, and tag either today or tomorrow. If I can get through the feature requests as well, I'll roll this in to a 1.18 release, if not, a patch for 1.17. |
no problem, it was kind of fun to write ;) that's great to hear, thanks. i'll be watching this issue eagerly 👀 |
Merging my PR, but await the tag. |
ETA today. I'll reopen and then close one I push. |
Releasing this evening (merging PRs at the moment), closing now to ensure I'm addressing everything for the next release. |
hey @twmb, i see a changelog entry for 1.18.0 -- thanks for getting this done! |
Problem: * Record A exceeds max, is on path to block * Record B finishes concurrently * Record A's context cancels * Record A's goroutine waiting to be unblocked returns, leaves accounting mutex in locked state * Record A's select statement chooses context-canceled case, trying to grab the accounting mutex lock See twmb#831 for more details. Closes twmb#831.
* Fix typo in kgo.Client.ResumeFetchTopics() docs Signed-off-by: Mihai Todor <todormihai@gmail.com> * add `NewOffsetFromRecord` helper function * Fix typo in Record.ProducerID doc comment. * Don't set nil config when seeding topics in kfake cluster Setting the configs to `nil` causes it to panic later when trying to alter the topic configs, as it only checks for entry in the map not being present, not for it being nil Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> * Add Opts method for sr.Client * Merge pull request twmb#826 from colega/don-t-set-nil-config-when-seeding-topics-in-kfake-cluster Don't set nil config when seeding topics in kfake cluster * Merge pull request twmb#821 from seizethedave/davidgrant/producer-doc Fix typo in Record.ProducerID doc comment. * Merge pull request twmb#812 from mihaitodor/fix-doc-typo Fix typo in kgo.Client.ResumeFetchTopics() docs * kgo: fix potential deadlock when reaching max buffered (records|bytes) Problem: * Record A exceeds max, is on path to block * Record B finishes concurrently * Record A's context cancels * Record A's goroutine waiting to be unblocked returns, leaves accounting mutex in locked state * Record A's select statement chooses context-canceled case, trying to grab the accounting mutex lock See twmb#831 for more details. Closes twmb#831. * all: unlint what is now cropping up gosec ones are delibate; govet ones are now randomly showing up (and also deliberate) * Merge pull request twmb#832 from twmb/831 kgo: fix potential deadlock when reaching max buffered (records|bytes) * kgo: misc doc update * kgo: ignore OOOSN where possible See embedded comment. This preceeds handling KIP-890. Closes twmb#805. * kip-890 definitions A bunch of version bumps to indicate TransactionAbortable is supported as an error return. * kip-848 more definitions Added in Kafka 3.8: * ListGroups.TypesFilter * ConsumerGroupDescribe request * kip-994 proto Only ListTransactions was modified in 3.8 * sr: add StatusCode to ResponseError, and message if the body is empty Closes twmb#819. * generate / kmsg: update GroupMetadata{Key,Value} Not much changed here. Closes twmb#804. * kgo: do not add all topics to internal tps map when regex consuming The internal tps map is meant to be what we store topicPartitions in that we are candidates to be consumed. This is filtered in assignPartitions to only opt-in partitions that are actually being consumed. It's not BAD if we store all topics in that map, but it's not the intent. The rest of the client worked fine even with extra topics in the map. When regex consuming, the metadata function previously put all topics into the map always. Now, we move the regex evaluation logic -- duplicated in both the direct and group consumers -- into one function and use that for filtering within metadata. This introduces a required sequence of filtering THEN finding assignments, which is fine / was the way things operated anyway. Moving the filtering to metadata (only in the regex consuming logic) means that we no longer store information for topics we are not consuming. Indirectly, this fixes a bug where `GetConsumeTopics` would always return ALL topics when regex consuming, because `GetConsumeTopics` always just returned what was in the `tps` field. This adds a test for the fixed behavior, as well as tests that NOT regex consuming always returns all topics the user is interested in. Closes twmb#810. * Merge pull request twmb#833 from twmb/proto-3.8.0 Proto 3.8.0 * kgo: support Kafka 3.8's kip-890 modifications STILL NOT ALL OF KIP-890, despite what I originally coded. Kafka 3.8 only added support for TransactionAbortable. Producers still need to send AddPartitionsToTxn. * kversion: note kip-848 additions for kafka 3.8 * kversion: note kip-994 added in 3.8, finalize 3.8 * kversion: ignore API keys 74,75 when guessing versions These are in Kraft only, and are two requests from two separate KIPs that aren't fully supported yet. Not sure why only these two were stabilized. * README: note 3.8 KIPs * kgo: bump kmsg pinned dep * Merge pull request twmb#840 from twmb/kafka-3.8.0 Kafka 3.8.0 * Merge pull request twmb#760 from twmb/753 kgo: add AllowRebalance and CloseAllowingRebalance to GroupTransactSession * Merge pull request twmb#789 from sbuliarca/errgroupsession-export-err kgo: export the wrapped error from ErrGroupSession * Merge pull request twmb#794 from twmb/790 kgo: add TopicID to the FetchTopic type * Merge pull request twmb#814 from noamcohen97/new-offset-helper kadm: add `NewOffsetFromRecord` helper function * Merge pull request twmb#829 from andrewstucki/sr-client-opts Add Opts method for sr.Client * Merge pull request twmb#834 from twmb/805 kgo: ignore OOOSN where possible * Merge pull request twmb#835 from twmb/819 sr: add StatusCode to ResponseError, and message if the body is empty * Merge pull request twmb#838 from twmb/810 kgo: do not add all topics to internal tps map when regex consuming * CHANGELOG: note incoming release * Merge pull request twmb#841 from twmb/1.18-changelog CHANGELOG: note incoming release * pkg/sr: require go 1.22 No real reason, no real reason not to. This also allows one commit after the top level franz tag. * Merge pull request twmb#842 from twmb/sr-1.22 pkg/sr: require go 1.22 * pkg/kadm: bump go deps * Merge pull request twmb#843 from twmb/kadm pkg/kadm: bump go deps --------- Signed-off-by: Mihai Todor <todormihai@gmail.com> Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Mihai Todor <todormihai@gmail.com> Co-authored-by: Noam Cohen <noam@noam.me> Co-authored-by: David Grant <seizethedave@gmail.com> Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Andrew Stucki <andrew.stucki@redpanda.com> Co-authored-by: Travis Bischel <travis.bischel+github@gmail.com>
We have observed what looks like a deadlock in franz-go.
We are using the
ProduceSync
method, and when we cancel the passed-in context, the call blocks forever.Client configuration:
kgo version: v1.17.1
Of note is that the client seems to be busy looping attempting to connect to the kafka broker, which i guess went away:
Some relevant call stacks:
I'll keep looking at this but your assistance would be greatly appreciated
The text was updated successfully, but these errors were encountered: