Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent DescribeLogDirs hang in admin client #2269

Merged
merged 2 commits into from
Jul 22, 2022

Conversation

zerowidth
Copy link
Contributor

This fixes a goroutine/waitgroup bug in the admin client. If an unknown broker ID is passed to DescribeLogDirs, it will hang forever.

It's a bug with waitgroups, where a waitgroup can be incremented without an associated decrement leading to an infinite wait. I think it's simple enough that it can be fixed with just visual inspection, but I included a test case just to be sure.

It's tricky to get a codebase into this scenario, but has occurred in our systems a few times. In summary, two goroutines using a shared sarama.Client:

  • One goroutine does various things to get a list of (then valid) broker IDs
  • Another goroutine runs a PartitionConsumer
  • A broker goes offline
  • The consumer goroutine, as part of its PartitionConsumer error recovery, eventually calls RefreshMetadata, updating the internal brokers list in the Client to remove the broker that's offline.
  • The first goroutine calls DescribeLogDirs with the original broker list, which still includes the dead broker
  • DescribeLogDirs
    • increments a waitgroup
    • calls findBroker, which doesn't find the dead broker in client.Brokers()
    • logs the error then continues the loop without firing off a goroutine with a wg.Done() to decrement
    • the final wg.Wait() hangs.

This was noted in the original PR introducing this admin API call (#1646 (comment)), but the bug occurs before any network IO is involved.

If DescribeLogDirs is called with an unknown broker ID, it will hang.
If a broker ID wasn't found, the loop would `continue` after incrementing the
waitgroup. Since there is no corresponding `wg.Done()` in that case, the final
`wg.Wait()` would hang forever.
@ghost ghost added the cla-needed label Jul 15, 2022
@dnwe dnwe added the fix label Jul 20, 2022
@dnwe
Copy link
Collaborator

dnwe commented Jul 20, 2022

@zerowidth 👋🏻 thanks for your contribution and the detailed description. Your changes look good — are you in a position to sign the Shopify CLA?

@zerowidth
Copy link
Contributor Author

are you in a position to sign the Shopify CLA?

I’ve signed the CLA but wasn’t able to rerun the probot check. Once that’s done we should be good to go. Thanks!

@ghost ghost removed the cla-needed label Jul 22, 2022
Copy link
Collaborator

@dnwe dnwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your contribution!

@dnwe dnwe merged commit e1fbb94 into IBM:main Jul 22, 2022
@dnwe dnwe changed the title Fix DescribeLogDirs hang in admin client fix: prevent DescribeLogDirs hang in admin client Jul 22, 2022
@zerowidth zerowidth deleted the fix-admin-describe-log-dir-hang branch November 22, 2022 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants