Channel consumer example is contradicting with documentation #161

vladaionescu · 2018-03-24T17:22:59Z

Description

In the example, we can see that in case of an error, the consumption is stopped (suggesting that an error is fatal): https://github.com/confluentinc/confluent-kafka-go/blob/master/examples/consumer_channel_example/consumer_channel_example.go#L83

In the documentation we have the following explanation (emphasis mine):

Generic events for both Consumer and Producer

KafkaError - client (error codes are prefixed with _) or broker error. These errors are normally just informational since the client will try its best to automatically recover (eventually).

I have a few questions:

Which one is correct?
If some errors are informational while others are fatal, how should the user application distinguish between the two?
In case of a fatal error, should the application just crash, or is it reasonable to attempt to close the consumer and recreate a new one, to recover?

How to reproduce

N/A

Checklist

Please provide the following information:

confluent-kafka-go and librdkafka version (LibraryVersion()):

[[constraint]]
  name = "github.com/confluentinc/confluent-kafka-go"
  version = "=0.11.0"

librdkafka 0.11.1-r1 (https://pkgs.alpinelinux.org/package/v3.7/community/x86_64/librdkafka)

Broker version

0.11.0

The text was updated successfully, but these errors were encountered:

edenhill · 2018-03-26T14:47:34Z

Much warranted questions.

Some answers:

The documentation is more correct - since the client will automatically try to recover from errors - i.e., there are no errors it permanently gives up on - the default behaviour of an app should be to not terminate on errors. The idea here is that there should be no permanent errors for a properly configured client for a properly managed cluster - there will be temporary errors, but they will eventually get sorted out automatically or by human intervention.
We're lagging behind here - we really need to provide a list of errors and how the application should handle them. Additionally, we should make the error reporting richer by also providing a severity to let an application automate some of these decisions. This work is on our backlog.
This depends on the application's requirements: if a producer receives a "fatal" error, should the application wait for messages in-queue/in-flight to be transmitted/fail - e.g., by calling flush()? If a consumer receives the same, should it attempt to commit final offsets? If the answers are yes, then do proper termination and cleanup. If not, maybe don't.

vladaionescu · 2018-03-26T20:23:15Z

Makes sense - thanks for the detailed explanations.

JeanMertz · 2018-06-07T08:48:51Z

Is there any news to report on this front? I'm really interested in knowing if you have a list somewhere that can tell us what errors are safe to ignore, and what errors should be acted on by closing the consumer/producer and restarting the application.

In our processors, we now ignore these two errors (based on searching the issues in this repository, such as this one: #48 (comment)):

kafka.ErrTransport
kafka.ErrAllBrokersDown

and (properly) close and restart the application on any other errors, but I'm sure there are a lot more we should be adding to our "safe errors" list.

edenhill · 2018-08-15T08:34:38Z

@JeanMertz As a general rule all errors are to be considered informational and temporary, given that your client and cluster is correctly configured. librdkafka will retry pretty much all operations.

But we will document the full set of error codes for each API so that users can make an educated choice what errors she might consider permanent, this is on our backlog and scheduled for the fall.

JeanMertz · 2019-02-15T09:32:56Z

@edenhill has this documentation happened yet?

I'm asking, because we've since implemented a feature in our code to allow librdkafka to solve "transient" errors:

blendle/go-streamprocessor#96

But we're seeing some strange behavior. In particular, we're getting an error such as this:

Received transient error from Kafka. Ignoring.
<kafka-url>:9092/bootstrap: Disconnected (after 2999941ms in state UP)

And then the processor just stops for an hour or more, before it starts working again (this last part is unclear, it might actually have restarted again before it starts working again).

This suggests this specific error is not transient, and we should act on this ourselves, by stopping and starting the processor again.

Do you have any more insights into this?

edenhill · 2019-03-11T08:55:00Z

Unless there is a configuration error, all errors should be considered temporary, i.e., a network failure will be fixed, a cluster restart will soon make the brokers available again.
There should never be a need to restart the client to fix any of these issues.
However, there are always bugs and you might have run into one of those.
In this case it could be confluentinc/librdkafka#2108

vladaionescu closed this as completed Mar 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Channel consumer example is contradicting with documentation #161

Channel consumer example is contradicting with documentation #161

vladaionescu commented Mar 24, 2018

Generic events for both Consumer and Producer

edenhill commented Mar 26, 2018

vladaionescu commented Mar 26, 2018

JeanMertz commented Jun 7, 2018 •

edited

Loading

edenhill commented Aug 15, 2018

JeanMertz commented Feb 15, 2019

edenhill commented Mar 11, 2019

Channel consumer example is contradicting with documentation #161

Channel consumer example is contradicting with documentation #161

Comments

vladaionescu commented Mar 24, 2018

Description

Generic events for both Consumer and Producer

How to reproduce

Checklist

edenhill commented Mar 26, 2018

vladaionescu commented Mar 26, 2018

JeanMertz commented Jun 7, 2018 • edited Loading

edenhill commented Aug 15, 2018

JeanMertz commented Feb 15, 2019

edenhill commented Mar 11, 2019

JeanMertz commented Jun 7, 2018 •

edited

Loading