-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Producer message reliability
This page outlines common producer failure status and how they are handled.
The following cases are handled identically in librdkafka:
- entire cluster is down
- no leader for partition
- no active connection to the partition leader
In all these cases the message(s) remain in the partition's queue waiting for an active connection to the partition leader. The client will perform metadata refreshes at regular intervals to check if the leader has changed, and it will also try to reconnect (forever) to any brokers that are down.
When the producer sends a ProduceRequest to the broker it will put the message(s) on the wait-response queue (waitresp). The ProduceRequest's protocol timeout is set to the timeout of the first message in the ProduceRequest batch, i.e., the oldest message.
(The message timeout is based on message.timeout.ms
, the timeout scanner runs roughly once per second (sub-second timeouts are thus meaningless)).
If no response is received from the broker within the request timeout the request fails and the message(s) in the request are failed, the RD_KAFKA_RESP_ERR__TIMED_OUT
error will be propagated through the delivery report. No retries will be made at this point since the message.timeout.ms
has been reached.
On the other hand, if the connection is closed before a response is received and before the timeout hits, then metadata is refreshed (to find out if there is a new partition leader that will take over from the down broker) and the messages are put back on the partition queue for a future retransmission.
Do note though that this retransmission does not increment the retry count; retries are incremented for temporary server-side errors, not connection losses (which might just be a sign of network instability).
For temporary errors returned by the broker, such asERR_REQUEST_TIMED_OUT
, the request is retried in its entirety and the retry counter is incremented by one.