Skip to content

Conversation

@k-raina
Copy link
Contributor

@k-raina k-raina commented Apr 21, 2025

Problem

  • Currently, when a transactional producer encounters retriable errors
    (like COORDINATOR_LOAD_IN_PROGRESS) and exhausts all retries, finally
    returns retriable error to Application Layer.

  • Application reties can cause duplicate records. As a fix we are
    transitioning all retriable errors as Abortable Error in transaction
    producer path.

  • Additionally added InvalidTxnStateException as part of
    https://issues.apache.org/jira/browse/KAFKA-19177

Solution

  • Modified the TransactionManager to automatically transition retriable
    errors to abortable errors after all retries are exhausted. This ensures
    that applications can abort transaction when they encounter
    TransactionAbortableException

  • RefreshRetriableException like CoordinatorNotAvailableException
    will be refreshed internally
    [code]
    till reties are expired, then it will be treated as retriable errors and
    translated to TransactionAbortableException

  • Similarly for InvalidTxnStateException

Testing

Added test testSenderShouldTransitionToAbortableAfterRetriesExhausted
to verify in sender thread:

  • Retriable errors are properly converted to abortable state after
    retries
  • Transaction state transitions correctly and subsequent operations fail
    appropriately with TransactionAbortableException

Reviewers: Justine Olshan jolshan@confluent.io

@github-actions github-actions bot added triage PRs from the community producer clients small Small PRs labels Apr 21, 2025
@k-raina k-raina changed the title Update Transactional producer to translate retriable into abortable exxceptions KAFKA-19176: Update Transactional producer to translate retriable into abortable exceptions Apr 21, 2025
if (error == null)
throw new IllegalArgumentException("Cannot transition to " + target + " with a null exception");

if (error instanceof RetriableException) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave a comment that RetriableExceptions from the Sender thread should be translated to abortable?

@github-actions github-actions bot removed the triage PRs from the community label Apr 25, 2025
Copy link
Member

@jolshan jolshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@k-raina k-raina requested a review from jolshan May 20, 2025 17:48
@k-raina
Copy link
Contributor Author

k-raina commented May 20, 2025

@jolshan Thanks for review
Added https://issues.apache.org/jira/browse/KAFKA-19177 with this PR in commit 1c73fd3

// RetriableExceptions from the Sender thread are converted to Abortable errors
// because they indicate that the transaction cannot be completed after all retry attempts.
// This conversion ensures the application layer treats these errors as abortable,
// preventing duplicate message delivery.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not a change we need now, but it isn't totally clear from the method name that this should only be called from the sender thread. Maybe we should refactor this in the future.

Copy link
Contributor Author

@k-raina k-raina Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.. Got me confused too.
Thanks for review

Copy link
Member

@jolshan jolshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jolshan jolshan merged commit 8c71ab0 into apache:trunk Jun 3, 2025
27 checks passed
k-raina added a commit to k-raina/kafka that referenced this pull request Jun 17, 2025
…o abortable exceptions (apache#19522)

### Problem
- Currently, when a transactional producer encounters retriable errors
(like `COORDINATOR_LOAD_IN_PROGRESS`) and exhausts all retries, finally
returns retriable error to Application Layer.
- Application reties can cause duplicate records. As a fix we are
transitioning all retriable errors  as Abortable Error in transaction
producer path.

- Additionally added InvalidTxnStateException as part of
https://issues.apache.org/jira/browse/KAFKA-19177

### Solution
- Modified the TransactionManager to automatically transition retriable
errors to abortable errors after all retries are exhausted. This ensures
that applications can abort transaction when they encounter
`TransactionAbortableException`

- `RefreshRetriableException` like `CoordinatorNotAvailableException`
will be refreshed internally

[[code](https://github.com/k-raina/kafka/blob/6c26595ce3d1608ae98ad4958b2ff8776a025fc3/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1702-L1705)]
till reties are expired, then it will be treated as retriable errors and
translated to `TransactionAbortableException`

- Similarly for InvalidTxnStateException

### Testing
Added test `testSenderShouldTransitionToAbortableAfterRetriesExhausted`
to verify in sender thread:
- Retriable errors are properly converted to abortable state after
retries
- Transaction state transitions correctly and subsequent operations fail
appropriately with TransactionAbortableException

Reviewers: Justine Olshan <jolshan@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants