Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic+partition specific produce errors from the broker not properly handled #40

Closed
winbatch opened this issue Dec 27, 2013 · 19 comments
Closed
Assignees

Comments

@winbatch
Copy link

This may be a Kafka bug rather than a librdkafka bug, but if I send messages larger than the value, I get no indication of failure. I get successful delivery callbacks. My consumers simply receive no messages and the only way I knew was that the JMX console showed failedproduce counter increases. Any chance the API could provide an error? Even if the broker doesn't, not sure if the API knows the max and can locally complain?

@ghost ghost assigned edenhill Dec 27, 2013
edenhill added a commit that referenced this issue Dec 27, 2013
Produce errors were previously only handled if the entire request failed.
@edenhill
Copy link
Contributor

Thank you for reporting this, it should now be fixed in master.
if you have configured a delivery report callback (dr_cb) it will be called with err set to RD_KAFKA_RESP_ERR_MSG_SIZE_TOO_LARGE.

Could you please update your librdkafka and verify this fix?

@winbatch
Copy link
Author

It's working in a standalone program. However, it's not working when I run it through my 'real' program. (The one that statically links with librdkafka.a). Not sure why yet. In that case, it just hangs and an starve complains about a futex timing out?

@edenhill
Copy link
Contributor

Could you try running it in gdb when it hangs and provide the output of:

thread apply all bt

You can mask out your program's traces

@winbatch
Copy link
Author

All say poll() except one that says pthread_cond_timedwait

On Monday, December 30, 2013, Magnus Edenhill wrote:

Could you try running it in gdb when it hangs and provide the output of:

thread apply all bt

You can mask out your program's traces


Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-31359677
.

@edenhill
Copy link
Contributor

Okay, the "futex timing" complaint, what prints that?

@winbatch
Copy link
Author

strace (sorry for the typo before, I'm writing via mobile phone)

On Monday, December 30, 2013, Magnus Edenhill wrote:

Okay, the "futex timing" complaint, what prints that?


Reply to this email directly or view it on GitHubhttps://github.com//issues/40#issuecomment-31362168
.

@winbatch
Copy link
Author

Actually, I'm able to reproduce it now with the standalone program as well. If the max message size on the broker is 4000000 and if I create a message that's 4000000, it fails (as designed). If I try 4000001, it hangs. And actually, even if the broker max is 1000000, looks like 4000001 will also make it hang. So maybe it just can't handle > 4000000 at all...

I'm also not sure on the math, because in my standalone program, I'm not using a key, but yet it will complain about 3999999 as being too large.

@edenhill
Copy link
Contributor

The message size restriction on the broker might include the header size, which adds a couple of more bytes.

Can you reproduce this with rdkafka_performance?:

rdkafka_performace -P -t <topic> -s <size> -c 1 -b <brokeraddr>

@winbatch
Copy link
Author

I can't reproduce the exact issue with rdkafka_performance, but I am definitely able to reproduce strange behavior, which I'll detail below. I think one difference may be that I'm using 'COPY' and you are using 'FREE' on the produce. But here's the weird behavior with the performance program, which may highlight to you an underlying problem that affects both. The max msg size on the broker in question is 1000000. You will see that numbers larger than 4000000 (so 4000001 and 5000000) both show successful when they should not... There is something about > 4000000 that behaves funny.

./rdkafka_performance -b kafkadevcluster1-1.aim.services.masked.com:5757,kafkadevcluster1-2.aim757,kafkadevcluster1-3.aim.services.masked.com:5757 -t LaraReplicator_kafkacluster3 -P -c 1 -s 999999 | tail -1
% 1 messages and 999999 bytes produced in 56ms: 17 msgs/s and 17.70 Mb/s, 1 messages failed, no compression
./rdkafka_performance -b kafkadevcluster1-1.aim.services.masked.com:5757,kafkadevcluster1-2.aim757,kafkadevcluster1-3.aim.services.masked.com:5757 -t LaraReplicator_kafkacluster3 -P -c 1 -s 1000000 | tail -1
% 1 messages and 1000000 bytes produced in 60ms: 16 msgs/s and 16.56 Mb/s, 1 messages failed, no compression
./rdkafka_performance -b kafkadevcluster1-1.aim.services.masked.com:5757,kafkadevcluster1-2.aim757,kafkadevcluster1-3.aim.services.masked.com:5757 -t LaraReplicator_kafkacluster3 -P -c 1 -s 4000000 | tail -1
% 1 messages and 4000000 bytes produced in 91ms: 10 msgs/s and 43.50 Mb/s, 1 messages failed, no compression
./rdkafka_performance -b kafkadevcluster1-1.aim.services.masked.com:5757,kafkadevcluster1-2.aim757,kafkadevcluster1-3.aim.services.masked.com:5757 -t LaraReplicator_kafkacluster3 -P -c 1 -s 4000001 | tail -1
% 1 messages and 4000001 bytes produced in 0ms: 1517 msgs/s and 6069.80 Mb/s, 0 messages failed, no compression
./rdkafka_performance -b kafkadevcluster1-1.aim.services.masked.com:5757,kafkadevcluster1-2.aim757,kafkadevcluster1-3.aim.services.masked.com:5757 -t LaraReplicator_kafkacluster3 -P -c 1 -s 800000 | tail -1
% 1 messages and 800000 bytes produced in 51ms: 19 msgs/s and 15.57 Mb/s, 0 messages failed, no compression
./rdkafka_performance -b kafkadevcluster1-1.aim.services.masked.com:5757,kafkadevcluster1-2.aim757,kafkadevcluster1-3.aim.services.masked.com:5757 -t LaraReplicator_kafkacluster3 -P -c 1 -s 5000000 | tail -1
% 1 messages and 5000000 bytes produced in 0ms: 5235 msgs/s and 26178 Mb/s, 0 messages failed, no compression

@edenhill
Copy link
Contributor

This is a problem with rdkafka_performance not counting produce() errors as failed messages:

% Sending 1 messages of size 9000000 bytes
produce error: Message too long
% 1 messages and 9000000 bytes produced in 0ms: 47619 msgs/s and 428571.44 Mb/s, 0 messages failed, no compression

The produce() call fails (returns -1) when the message size is larger than the LOCALLY configured message.max.bytes value (which defaults to 4000000).

But this does not indicate an error on the librdkafka side of things though.

@winbatch
Copy link
Author

ok, but the message delivery callback doesn't get called.... I would have expected it to be called with an error code.

@edenhill
Copy link
Contributor

It will fail directly for messages that stand no chance of delivery, see here:

https://github.com/edenhill/librdkafka/blob/master/rdkafka.h#L649

* Returns 0 on success or -1 on error in which case errno is set accordingly:
*   ENOBUFS  - maximum number of outstanding messages has been reached:
*              "queue.buffering.max.message"
*   EMSGSIZE - message is larger than configured max size:
*              "messages.max.bytes".

@winbatch
Copy link
Author

ok for now, though feels a bit inconsistent from a user perspective. (some cases via callback and some right after calling).

Should I be calling rd_kafka_err2str on the errno? It doesn't seem to be able to translate it. In this case the errno is set to 90.

@edenhill
Copy link
Contributor

It might seem inconsistent to have two different error reporting facilities, but it allows the application to take actions immediately:
I.e., for ENOBUFS the application can propogate backpressure or spool the message on disk.

errno is the standard system error codes, use strerror().

One could argue that produce() should return the rd_kafka_resp_err_t codes, that would be more consistent, but it would break existing applications at this point.

@winbatch
Copy link
Author

As long as I can reliably detect errors (which I'll do using both methods), I'm good. Thanks as always for the fast and useful responses.

I guess you'll open a ticket for the minor tweak to the performance tester for error counting?

@winbatch
Copy link
Author

Saw that this issue was closed, but wanted to check whether you plan to enhance the performance binary to properly indicate errors that may have occurred.

'This is a problem with rdkafka_performance not counting produce() errors as failed messages:

% Sending 1 messages of size 9000000 bytes
produce error: Message too long
% 1 messages and 9000000 bytes produced in 0ms: 47619 msgs/s and 428571.44 Mb/s, 0 messages failed, no compression'

@edenhill
Copy link
Contributor

This was fixed in:
commit d25d782
Date: Wed Jan 1 23:26:07 2014 +0100

rdkafka_performance: Count immediate produce() failures as failed messages too
$ ./rdkafka_performance -P -t onepart -p 1 -c 1 -s 1241241111
% Using random seed 1391083001
% Sending 1 messages of size 1241241111 bytes
produce error: Message too long
% 1 messages and 1241241111 bytes produced in 0ms: 27027 msgs/s and 33547056.00 Mb/s, 1 messages failed, no compression
All messages produced, now waiting for 1 deliveries
% 0 messages in outq
% 1 backpressures for 1 produce calls: 100.000% backpressure rate
% 1 messages and 1241241111 bytes produced in 0ms: 9708 msgs/s and 12050884.00 Mb/s, **1 messages failed**, no compression

@winbatch
Copy link
Author

ok cool. serves me right for not having tested it again before asking ;)

@edenhill
Copy link
Contributor

Well, I had to double check aswell ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants