Skip to content

Infinite loop when consumer doesn't have leaders for all partitions #1204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rmca opened this issue Sep 8, 2017 · 5 comments
Closed

Infinite loop when consumer doesn't have leaders for all partitions #1204

rmca opened this issue Sep 8, 2017 · 5 comments

Comments

@rmca
Copy link

rmca commented Sep 8, 2017

If a partition doesn't have a leader, a consumer will block and loop indefinitely on creation. Ideally this would throw an error so the service can handle that case and do something sensible.

The problem is that Fetcher._reset_offsets is called by Fetcher._reset_offset without a timeout:

offsets = self._retrieve_offsets({partition: timestamp})
Since the consumer cannot read a partition leader for all partitions, the call to
future = self._send_offset_requests(timestamps)
fails and will fail indefinitely. The default timeout is infinity, which causes an infinite loop.

@jeffwidman
Copy link
Contributor

jeffwidman commented Sep 11, 2017

Thanks for the report.

There's no easy solutions here, as the general assumption is that a partition without a leader indicates a cluster in the middle of failover, which typically is a fairly transient thing, so retries make sense.

I would not immediately throw an error because that would require every service to have to be aware enough of Kafka internals to know to handle this error, which sounds error prone and tedious. I also would prefer to keep the default timeout at infinite, because if I have a partition that has a few hours of downtime, I don't want to also have to worry about tracking down all consumers to restart them once that partition is fixed.

That said, I do see how in specific situations it'd be nice to be able to override the infinite timeout.

I'm curious how the Java consumer handles this case?

PS: I edited your question slightly to update the formatting w/o changing the message contents.

@rmca
Copy link
Author

rmca commented Sep 19, 2017

Thanks for the response! Sorry for the delay in responding. This got away from me slightly.

So, my main issue with the current behaviour is that it's not possible to get any information from a consumer when it's blocked like this. As you said, it could take several hours for a cluster during a failover to recover, and meanwhile the consumer is blocked in an infinite loop, and so can't output metrics or logs explaining why it's not processing anything.

Changing the current behaviour seems like a bad idea, but maybe a decent compromise is an optional timeout if needed?

I'll check and see what the Java library behaviour is, but it may take me a few days to try that out with my current commitments.

BTW, I updated the links in my initial message to refer to specific commits.

@sherifzain
Copy link

@dpkp
Copy link
Owner

dpkp commented Oct 22, 2017

I believe this is a dup of #686

@dpkp
Copy link
Owner

dpkp commented Oct 25, 2017

closing as duplicate

@dpkp dpkp closed this as completed Oct 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants