Topic+partition specific produce errors from the broker not properly handled #40

winbatch · 2013-12-27T18:19:08Z

This may be a Kafka bug rather than a librdkafka bug, but if I send messages larger than the value, I get no indication of failure. I get successful delivery callbacks. My consumers simply receive no messages and the only way I knew was that the JMX console showed failedproduce counter increases. Any chance the API could provide an error? Even if the broker doesn't, not sure if the API knows the max and can locally complain?

Produce errors were previously only handled if the entire request failed.

edenhill · 2013-12-27T23:38:48Z

Thank you for reporting this, it should now be fixed in master.
if you have configured a delivery report callback (dr_cb) it will be called with err set to RD_KAFKA_RESP_ERR_MSG_SIZE_TOO_LARGE.

Could you please update your librdkafka and verify this fix?

winbatch · 2013-12-30T18:17:14Z

It's working in a standalone program. However, it's not working when I run it through my 'real' program. (The one that statically links with librdkafka.a). Not sure why yet. In that case, it just hangs and an starve complains about a futex timing out?

edenhill · 2013-12-30T18:20:50Z

Could you try running it in gdb when it hangs and provide the output of:

thread apply all bt

You can mask out your program's traces

winbatch · 2013-12-30T19:06:09Z

All say poll() except one that says pthread_cond_timedwait

On Monday, December 30, 2013, Magnus Edenhill wrote:

Could you try running it in gdb when it hangs and provide the output of:

thread apply all bt

You can mask out your program's traces

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-31359677
.

edenhill · 2013-12-30T19:07:15Z

Okay, the "futex timing" complaint, what prints that?

winbatch · 2013-12-30T19:08:05Z

strace (sorry for the typo before, I'm writing via mobile phone)

On Monday, December 30, 2013, Magnus Edenhill wrote:

Okay, the "futex timing" complaint, what prints that?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-31362168
.

winbatch · 2013-12-31T03:08:12Z

Actually, I'm able to reproduce it now with the standalone program as well. If the max message size on the broker is 4000000 and if I create a message that's 4000000, it fails (as designed). If I try 4000001, it hangs. And actually, even if the broker max is 1000000, looks like 4000001 will also make it hang. So maybe it just can't handle > 4000000 at all...

I'm also not sure on the math, because in my standalone program, I'm not using a key, but yet it will complain about 3999999 as being too large.

edenhill · 2013-12-31T08:43:31Z

The message size restriction on the broker might include the header size, which adds a couple of more bytes.

Can you reproduce this with rdkafka_performance?:

rdkafka_performace -P -t <topic> -s <size> -c 1 -b <brokeraddr>

winbatch · 2013-12-31T11:34:04Z

I can't reproduce the exact issue with rdkafka_performance, but I am definitely able to reproduce strange behavior, which I'll detail below. I think one difference may be that I'm using 'COPY' and you are using 'FREE' on the produce. But here's the weird behavior with the performance program, which may highlight to you an underlying problem that affects both. The max msg size on the broker in question is 1000000. You will see that numbers larger than 4000000 (so 4000001 and 5000000) both show successful when they should not... There is something about > 4000000 that behaves funny.

./rdkafka_performance -b kafkadevcluster1-1.aim.services.masked.com:5757,kafkadevcluster1-2.aim757,kafkadevcluster1-3.aim.services.masked.com:5757 -t LaraReplicator_kafkacluster3 -P -c 1 -s 999999 | tail -1
% 1 messages and 999999 bytes produced in 56ms: 17 msgs/s and 17.70 Mb/s, 1 messages failed, no compression
./rdkafka_performance -b kafkadevcluster1-1.aim.services.masked.com:5757,kafkadevcluster1-2.aim757,kafkadevcluster1-3.aim.services.masked.com:5757 -t LaraReplicator_kafkacluster3 -P -c 1 -s 1000000 | tail -1
% 1 messages and 1000000 bytes produced in 60ms: 16 msgs/s and 16.56 Mb/s, 1 messages failed, no compression
./rdkafka_performance -b kafkadevcluster1-1.aim.services.masked.com:5757,kafkadevcluster1-2.aim757,kafkadevcluster1-3.aim.services.masked.com:5757 -t LaraReplicator_kafkacluster3 -P -c 1 -s 4000000 | tail -1
% 1 messages and 4000000 bytes produced in 91ms: 10 msgs/s and 43.50 Mb/s, 1 messages failed, no compression
./rdkafka_performance -b kafkadevcluster1-1.aim.services.masked.com:5757,kafkadevcluster1-2.aim757,kafkadevcluster1-3.aim.services.masked.com:5757 -t LaraReplicator_kafkacluster3 -P -c 1 -s 4000001 | tail -1
% 1 messages and 4000001 bytes produced in 0ms: 1517 msgs/s and 6069.80 Mb/s, 0 messages failed, no compression
./rdkafka_performance -b kafkadevcluster1-1.aim.services.masked.com:5757,kafkadevcluster1-2.aim757,kafkadevcluster1-3.aim.services.masked.com:5757 -t LaraReplicator_kafkacluster3 -P -c 1 -s 800000 | tail -1
% 1 messages and 800000 bytes produced in 51ms: 19 msgs/s and 15.57 Mb/s, 0 messages failed, no compression
./rdkafka_performance -b kafkadevcluster1-1.aim.services.masked.com:5757,kafkadevcluster1-2.aim757,kafkadevcluster1-3.aim.services.masked.com:5757 -t LaraReplicator_kafkacluster3 -P -c 1 -s 5000000 | tail -1
% 1 messages and 5000000 bytes produced in 0ms: 5235 msgs/s and 26178 Mb/s, 0 messages failed, no compression

edenhill · 2013-12-31T11:40:25Z

This is a problem with rdkafka_performance not counting produce() errors as failed messages:

% Sending 1 messages of size 9000000 bytes
produce error: Message too long
% 1 messages and 9000000 bytes produced in 0ms: 47619 msgs/s and 428571.44 Mb/s, 0 messages failed, no compression

The produce() call fails (returns -1) when the message size is larger than the LOCALLY configured message.max.bytes value (which defaults to 4000000).

But this does not indicate an error on the librdkafka side of things though.

winbatch · 2013-12-31T11:46:41Z

ok, but the message delivery callback doesn't get called.... I would have expected it to be called with an error code.

edenhill · 2013-12-31T11:52:29Z

It will fail directly for messages that stand no chance of delivery, see here:

https://github.com/edenhill/librdkafka/blob/master/rdkafka.h#L649

* Returns 0 on success or -1 on error in which case errno is set accordingly:
*   ENOBUFS  - maximum number of outstanding messages has been reached:
*              "queue.buffering.max.message"
*   EMSGSIZE - message is larger than configured max size:
*              "messages.max.bytes".

winbatch · 2013-12-31T11:55:45Z

ok for now, though feels a bit inconsistent from a user perspective. (some cases via callback and some right after calling).

Should I be calling rd_kafka_err2str on the errno? It doesn't seem to be able to translate it. In this case the errno is set to 90.

edenhill · 2013-12-31T12:00:53Z

It might seem inconsistent to have two different error reporting facilities, but it allows the application to take actions immediately:
I.e., for ENOBUFS the application can propogate backpressure or spool the message on disk.

errno is the standard system error codes, use strerror().

One could argue that produce() should return the rd_kafka_resp_err_t codes, that would be more consistent, but it would break existing applications at this point.

winbatch · 2013-12-31T12:06:30Z

As long as I can reliably detect errors (which I'll do using both methods), I'm good. Thanks as always for the fast and useful responses.

I guess you'll open a ticket for the minor tweak to the performance tester for error counting?

winbatch · 2014-01-30T11:50:28Z

Saw that this issue was closed, but wanted to check whether you plan to enhance the performance binary to properly indicate errors that may have occurred.

'This is a problem with rdkafka_performance not counting produce() errors as failed messages:

% Sending 1 messages of size 9000000 bytes
produce error: Message too long
% 1 messages and 9000000 bytes produced in 0ms: 47619 msgs/s and 428571.44 Mb/s, 0 messages failed, no compression'

edenhill · 2014-01-30T11:57:26Z

This was fixed in:
commit d25d782
Date: Wed Jan 1 23:26:07 2014 +0100

rdkafka_performance: Count immediate produce() failures as failed messages too

$ ./rdkafka_performance -P -t onepart -p 1 -c 1 -s 1241241111
% Using random seed 1391083001
% Sending 1 messages of size 1241241111 bytes
produce error: Message too long
% 1 messages and 1241241111 bytes produced in 0ms: 27027 msgs/s and 33547056.00 Mb/s, 1 messages failed, no compression
All messages produced, now waiting for 1 deliveries
% 0 messages in outq
% 1 backpressures for 1 produce calls: 100.000% backpressure rate
% 1 messages and 1241241111 bytes produced in 0ms: 9708 msgs/s and 12050884.00 Mb/s, **1 messages failed**, no compression

winbatch · 2014-01-30T11:58:26Z

ok cool. serves me right for not having tested it again before asking ;)

edenhill · 2014-01-30T12:03:04Z

Well, I had to double check aswell ;)

ghost assigned edenhill Dec 27, 2013

edenhill added a commit that referenced this issue Dec 27, 2013

Handle per topic+partition produce errors (issue #40)

a52f585

Produce errors were previously only handled if the entire request failed.

edenhill closed this as completed Dec 28, 2013

DEvil0000 mentioned this issue Oct 12, 2015

rdkafka stopping connection retries on RdKafka::ERR__ALL_BROKERS_DOWN #373

Closed

MockingJayWong mentioned this issue Jul 9, 2019

rd_kafka_topic_new after using admin api #2399

Closed

7 tasks

Furuta-Masakazu-quick mentioned this issue Sep 29, 2020

Producer detects "All broker connections are down" when rolling restarts #3090

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic+partition specific produce errors from the broker not properly handled #40

Topic+partition specific produce errors from the broker not properly handled #40

winbatch commented Dec 27, 2013

edenhill commented Dec 27, 2013

winbatch commented Dec 30, 2013

edenhill commented Dec 30, 2013

winbatch commented Dec 30, 2013

edenhill commented Dec 30, 2013

winbatch commented Dec 30, 2013

winbatch commented Dec 31, 2013

edenhill commented Dec 31, 2013

winbatch commented Dec 31, 2013

edenhill commented Dec 31, 2013

winbatch commented Dec 31, 2013

edenhill commented Dec 31, 2013

winbatch commented Dec 31, 2013

edenhill commented Dec 31, 2013

winbatch commented Dec 31, 2013

winbatch commented Jan 30, 2014

edenhill commented Jan 30, 2014

winbatch commented Jan 30, 2014

edenhill commented Jan 30, 2014

Topic+partition specific produce errors from the broker not properly handled #40

Topic+partition specific produce errors from the broker not properly handled #40

Comments

winbatch commented Dec 27, 2013

edenhill commented Dec 27, 2013

winbatch commented Dec 30, 2013

edenhill commented Dec 30, 2013

winbatch commented Dec 30, 2013

edenhill commented Dec 30, 2013

winbatch commented Dec 30, 2013

winbatch commented Dec 31, 2013

edenhill commented Dec 31, 2013

winbatch commented Dec 31, 2013

edenhill commented Dec 31, 2013

winbatch commented Dec 31, 2013

edenhill commented Dec 31, 2013

winbatch commented Dec 31, 2013

edenhill commented Dec 31, 2013

winbatch commented Dec 31, 2013

winbatch commented Jan 30, 2014

edenhill commented Jan 30, 2014

winbatch commented Jan 30, 2014

edenhill commented Jan 30, 2014