Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-3949: Fix race condition between group rebalance and metadata update #1241

Closed
jeffwidman opened this issue Oct 6, 2017 · 4 comments
Closed
Assignees
Labels
Milestone

Comments

@jeffwidman
Copy link
Collaborator

jeffwidman commented Oct 6, 2017

Details in https://issues.apache.org/jira/browse/KAFKA-3949

Note that the fix refactored a fair bit of code for managing subscription state: apache/kafka#1762

And then KIP-70 (tracked in #1242) further modified this code.

Related: #1237 / #1240 / #1242

@jeffwidman
Copy link
Collaborator Author

jeffwidman commented Oct 24, 2017

@dpkp, I saw you self-assigned this... are you actively working on it?

I'm once again seeing the symptoms of #1237 from time to time in the test suite for our internal wrapper around kafka-python at my day job. I assume this is the root cause, although I haven't had time to verify yet.

At my day job, it's a priority to get this fixed, as a particular service that uses pattern subscription has multiple consumer groups across many hosts so noticing when they're stalled due to zombie subscriptions is very difficult.

I've spent some time reading through the Java fix (which appears more straightforward than I initially thought), but haven't started porting it yet. Mostly I've been trying to figure out how to write a test that forces this race condition (#1251).

@dpkp I am certain you will be much faster than I will at porting the Java fix, so if you're interested in this, by all means please tackle it. But if you don't think you'll be able to get to it for several weeks, let me know as I'll be able to spend more time on it next week.

@dpkp
Copy link
Owner

dpkp commented Oct 24, 2017

Yes, I have a PR ready but it depends on #1266 so I have not pushed it yet.

@dpkp dpkp added this to the 1.4 Release milestone Jan 12, 2018
@jeffwidman
Copy link
Collaborator Author

jeffwidman commented Jan 29, 2018

@dpkp now that #1266 is merged, do you mind pushing up your PR for this?

I've got another team that is hitting consumers silently failing to consume every once in a while, and I suspect it's due to this...

@dpkp
Copy link
Owner

dpkp commented Jan 29, 2018

Yes -- unfortunately it has got some complex merge conflicts that need working out. Hoping to get these resolved + tested locally this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants