Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel consumer example is contradicting with documentation #161

Closed
vladaionescu opened this issue Mar 24, 2018 · 6 comments
Closed

Channel consumer example is contradicting with documentation #161

vladaionescu opened this issue Mar 24, 2018 · 6 comments

Comments

@vladaionescu
Copy link
Contributor

Description

In the example, we can see that in case of an error, the consumption is stopped (suggesting that an error is fatal): https://github.com/confluentinc/confluent-kafka-go/blob/master/examples/consumer_channel_example/consumer_channel_example.go#L83

In the documentation we have the following explanation (emphasis mine):

Generic events for both Consumer and Producer

  • KafkaError - client (error codes are prefixed with _) or broker error. These errors are normally just informational since the client will try its best to automatically recover (eventually).

I have a few questions:

  1. Which one is correct?
  2. If some errors are informational while others are fatal, how should the user application distinguish between the two?
  3. In case of a fatal error, should the application just crash, or is it reasonable to attempt to close the consumer and recreate a new one, to recover?

How to reproduce

N/A

Checklist

Please provide the following information:

  • confluent-kafka-go and librdkafka version (LibraryVersion()):
[[constraint]]
  name = "github.com/confluentinc/confluent-kafka-go"
  version = "=0.11.0"

librdkafka 0.11.1-r1 (https://pkgs.alpinelinux.org/package/v3.7/community/x86_64/librdkafka)

  • Broker version

0.11.0

@edenhill
Copy link
Contributor

Much warranted questions.

Some answers:

  1. The documentation is more correct - since the client will automatically try to recover from errors - i.e., there are no errors it permanently gives up on - the default behaviour of an app should be to not terminate on errors. The idea here is that there should be no permanent errors for a properly configured client for a properly managed cluster - there will be temporary errors, but they will eventually get sorted out automatically or by human intervention.
  2. We're lagging behind here - we really need to provide a list of errors and how the application should handle them. Additionally, we should make the error reporting richer by also providing a severity to let an application automate some of these decisions. This work is on our backlog.
  3. This depends on the application's requirements: if a producer receives a "fatal" error, should the application wait for messages in-queue/in-flight to be transmitted/fail - e.g., by calling flush()? If a consumer receives the same, should it attempt to commit final offsets? If the answers are yes, then do proper termination and cleanup. If not, maybe don't.

@vladaionescu
Copy link
Contributor Author

Makes sense - thanks for the detailed explanations.

@JeanMertz
Copy link

JeanMertz commented Jun 7, 2018

Is there any news to report on this front? I'm really interested in knowing if you have a list somewhere that can tell us what errors are safe to ignore, and what errors should be acted on by closing the consumer/producer and restarting the application.

In our processors, we now ignore these two errors (based on searching the issues in this repository, such as this one: #48 (comment)):

kafka.ErrTransport
kafka.ErrAllBrokersDown

and (properly) close and restart the application on any other errors, but I'm sure there are a lot more we should be adding to our "safe errors" list.

@edenhill
Copy link
Contributor

@JeanMertz As a general rule all errors are to be considered informational and temporary, given that your client and cluster is correctly configured. librdkafka will retry pretty much all operations.

But we will document the full set of error codes for each API so that users can make an educated choice what errors she might consider permanent, this is on our backlog and scheduled for the fall.

@JeanMertz
Copy link

@edenhill has this documentation happened yet?

I'm asking, because we've since implemented a feature in our code to allow librdkafka to solve "transient" errors:

blendle/go-streamprocessor#96

But we're seeing some strange behavior. In particular, we're getting an error such as this:

Received transient error from Kafka. Ignoring.
<kafka-url>:9092/bootstrap: Disconnected (after 2999941ms in state UP)

And then the processor just stops for an hour or more, before it starts working again (this last part is unclear, it might actually have restarted again before it starts working again).

This suggests this specific error is not transient, and we should act on this ourselves, by stopping and starting the processor again.

Do you have any more insights into this?

@edenhill
Copy link
Contributor

Unless there is a configuration error, all errors should be considered temporary, i.e., a network failure will be fixed, a cluster restart will soon make the brokers available again.
There should never be a need to restart the client to fix any of these issues.
However, there are always bugs and you might have run into one of those.
In this case it could be confluentinc/librdkafka#2108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants