-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topic+partition specific produce errors from the broker not properly handled #40
Comments
Produce errors were previously only handled if the entire request failed.
Thank you for reporting this, it should now be fixed in master. Could you please update your librdkafka and verify this fix? |
It's working in a standalone program. However, it's not working when I run it through my 'real' program. (The one that statically links with librdkafka.a). Not sure why yet. In that case, it just hangs and an starve complains about a futex timing out? |
Could you try running it in gdb when it hangs and provide the output of:
You can mask out your program's traces |
All say poll() except one that says pthread_cond_timedwait On Monday, December 30, 2013, Magnus Edenhill wrote:
|
Okay, the "futex timing" complaint, what prints that? |
strace (sorry for the typo before, I'm writing via mobile phone) On Monday, December 30, 2013, Magnus Edenhill wrote:
|
Actually, I'm able to reproduce it now with the standalone program as well. If the max message size on the broker is 4000000 and if I create a message that's 4000000, it fails (as designed). If I try 4000001, it hangs. And actually, even if the broker max is 1000000, looks like 4000001 will also make it hang. So maybe it just can't handle > 4000000 at all... I'm also not sure on the math, because in my standalone program, I'm not using a key, but yet it will complain about 3999999 as being too large. |
The message size restriction on the broker might include the header size, which adds a couple of more bytes. Can you reproduce this with rdkafka_performance?:
|
I can't reproduce the exact issue with rdkafka_performance, but I am definitely able to reproduce strange behavior, which I'll detail below. I think one difference may be that I'm using 'COPY' and you are using 'FREE' on the produce. But here's the weird behavior with the performance program, which may highlight to you an underlying problem that affects both. The max msg size on the broker in question is 1000000. You will see that numbers larger than 4000000 (so 4000001 and 5000000) both show successful when they should not... There is something about > 4000000 that behaves funny. ./rdkafka_performance -b kafkadevcluster1-1.aim.services.masked.com:5757,kafkadevcluster1-2.aim757,kafkadevcluster1-3.aim.services.masked.com:5757 -t LaraReplicator_kafkacluster3 -P -c 1 -s 999999 | tail -1 |
This is a problem with rdkafka_performance not counting produce() errors as failed messages: % Sending 1 messages of size 9000000 bytes The produce() call fails (returns -1) when the message size is larger than the LOCALLY configured message.max.bytes value (which defaults to 4000000). But this does not indicate an error on the librdkafka side of things though. |
ok, but the message delivery callback doesn't get called.... I would have expected it to be called with an error code. |
It will fail directly for messages that stand no chance of delivery, see here: https://github.com/edenhill/librdkafka/blob/master/rdkafka.h#L649
|
ok for now, though feels a bit inconsistent from a user perspective. (some cases via callback and some right after calling). Should I be calling rd_kafka_err2str on the errno? It doesn't seem to be able to translate it. In this case the errno is set to 90. |
It might seem inconsistent to have two different error reporting facilities, but it allows the application to take actions immediately: errno is the standard system error codes, use strerror(). One could argue that produce() should return the rd_kafka_resp_err_t codes, that would be more consistent, but it would break existing applications at this point. |
As long as I can reliably detect errors (which I'll do using both methods), I'm good. Thanks as always for the fast and useful responses. I guess you'll open a ticket for the minor tweak to the performance tester for error counting? |
Saw that this issue was closed, but wanted to check whether you plan to enhance the performance binary to properly indicate errors that may have occurred. 'This is a problem with rdkafka_performance not counting produce() errors as failed messages: % Sending 1 messages of size 9000000 bytes |
This was fixed in:
|
ok cool. serves me right for not having tested it again before asking ;) |
Well, I had to double check aswell ;) |
This may be a Kafka bug rather than a librdkafka bug, but if I send messages larger than the value, I get no indication of failure. I get successful delivery callbacks. My consumers simply receive no messages and the only way I knew was that the JMX console showed failedproduce counter increases. Any chance the API could provide an error? Even if the broker doesn't, not sure if the API knows the max and can locally complain?
The text was updated successfully, but these errors were encountered: