-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transactional producer hangs on commit (internal flush) leading to "No PID available (idempotence state WaitPID)" #3041
Comments
librdkafka logs
|
hi @edenhill, did you get a chance to see this issue? I've seen multiple issues coming due to this, even with just enabling idempotent (without transactions). There's nothing in the broker logs that could suggest something useful. |
I could capture some more details about the issue, this time while attempting to Abort the transaction. Application hangs while doing a "CommitAndStartNewTransaction()" operation. Actual problem is in flushing the messages. Below is the stackstrace
Problem: based on above stack trace it looks like producer is waiting for acks for enqueued messages
But, it never gets signaled on the rkq_cond. Could you think of any reasons why it would fail to get signaled? I checked the broker logs, but, I couldn't see any errors on any of the brokers. Now that main thread is stuck above, I see the following issues:
At this time the internal PID has become invalid.
But, due to idempotent producer FSM already invalidating PID, an attempt to call abort transaction results in the following error:
So, now I can understand why I'm seeing "No PID available", but, there are still few questions that I would like to understand & get your help on
Look forward to hear back from you |
Hi @edenhill , Code that I changed on
Logs corresponding to this random occurrence:
Message and transaction timeout is set to 3 mins above One peculiarity in my application is that the librdkafka archive library is built using gcc 4.8.5, but my , application is built using clang 5.0.1. Based on this info could you suggest pointers as to what might be going wrong here? |
hey @edenhill I further narrowed the problem to this block of code
In my case I couldn't understand why the rktp registration would be scheduled only when
|
To answer your questions in #3041 (comment)
If you pass an infinite timeout it will be used in two different ways; the infinite timeout will be used for the internal call to flush() (since the flush happens first), while the internal API timeout will be adjusted to your transaction.timeout.ms. It will block on the flush() until all the messages have been delivered or failed, or the timeout hits, whichever comes first. In the case of the timeout a retriable error will be returned and it is okay to call commit_transaction() again. I managed to reproduce this by setting the partition leaders to an unavailable broker (making the messages wait in queue) and then making the transaction coordinator unavailable just prior to calling commit_transaction(), this resulted in a myriad of timeouts that eventually bubbled up to the application. One of the issues I found during this test was that messages that failed in the transmission queue, rather than on the request level, did not really flag the transaction as failed, and a transaction_commit() would have succeeded if the app did not take action in the delivery report (which it shouldnt need to do). The second issue found was that an EndTxnRequest that would fail (for broker or local reasons) could get stuck in endless retries. The fix is to only retry if the state (aborting|committing) allows it.
I think it should be fine with passing -1 to commit_transaction().
This is not an issue I've been able to reproduce. |
…tes (#3041) Previously the transaction could hang on commit_transaction() if an abortable error was hit and the EndTxnRequest was to be retried.
…tes (#3041) Previously the transaction could hang on commit_transaction() if an abortable error was hit and the EndTxnRequest was to be retried.
Description
Producer attempted to update a transaction while another concurrent operation on the same transaction was ongoing
Aborting with uncaught exception: abort transaction failed error=No PID available
How to reproduce
It's quite random, I couldn't reproduce this issue on demand
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
v0.10.5
2.1.3
linger.ms=50 batch.num.messages=10000 compression.codec=snappy batch.num.messages=100000 message.timeout.ms=600000 transaction.timeout.ms=600000 retry.backoff.ms=500 enable.idempotence=true queue.buffering.max.messages=10000000 queue.buffering.max.kbytes=1048576 debug=eos,broker,msg,topic
Centos 7.6
debug=eos,broker,msg,topic
as necessary) from librdkafkanothing exceptional on broker logs
The text was updated successfully, but these errors were encountered: