Skip to content

KAFKA-16563: retry pollEvent in KRaftMigrationDriver for retriable errors#15732

Merged
showuon merged 4 commits intoapache:trunkfrom
showuon:KAFKA-16563
Apr 29, 2024
Merged

KAFKA-16563: retry pollEvent in KRaftMigrationDriver for retriable errors#15732
showuon merged 4 commits intoapache:trunkfrom
showuon:KAFKA-16563

Conversation

@showuon
Copy link
Member

@showuon showuon commented Apr 16, 2024

When running ZK migrating to KRaft process, we encountered an issue that the migrating is hanging and the ZkMigrationState cannot move to MIGRATION state. And it is because the pollEvent didn't retry with the retriable MigrationClientException (ZK client retriable errors) while it should. This PR fixes it and add test. And because of this, the poll event will not poll anymore, which causes the KRaftMigrationDriver hanging.

We could consider to let the leader node do the znode creation only to avoid this conflict issue. But that will be another improvement.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@showuon
Copy link
Member Author

showuon commented Apr 16, 2024

@cmccabe @mumrah , call for review. Thanks.

Comment on lines 394 to 399
// Use no-op handler by default because the handleException will be overridden if needed
private Consumer<Throwable> retryHandler = NO_OP_HANDLER;

public void retryHandler(Consumer<Throwable> retryHandler) {
this.retryHandler = retryHandler;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider simply defining an empty public void retryHandler(Throwable thrown) that PollEvent can override?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this style!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we call wakeup (run next poll ASAP) rather that scheduleDeferred if the exception is retryable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider simply defining an empty public void retryHandler(Throwable thrown) that PollEvent can override?

Nice suggestion! Updated!

Also, should we call wakeup (run next poll ASAP) rather that scheduleDeferred if the exception is retryable?

Thanks for the suggestion. I think that's not appropriate because if the retriable error needs some time to be fixed (ex: the ZK connection issue), the pollEvent will be invoked a lot of times (and keep retrying) in a short period of time.

KRaftMigrationDriver.this.faultHandler.handleFault("Encountered ZooKeeper authentication in " + this, e);
} else if (e instanceof MigrationClientException) {
log.info(String.format("Encountered ZooKeeper error during event %s. Will retry.", this), e.getCause());
retryHandler();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the retry is existent except for UNINITIALIZED since UNINITIALIZED is not running by another event. For other event type, PollEvent will put (do-something event + one deferred PollEvent) to the queue. It means the deferred PollEvent is the "retry".

My question is "why we did not handle UNINITIALIZED by another event"? If we move recoverMigrationStateFromZK to another event, we don't need to add extra retryHandler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the solution offered by this PR has a side effect that we will put 2 PollEvent if the exception MigrationClientException happens in other migrationState

Copy link
Member Author

@showuon showuon Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is "why we did not handle UNINITIALIZED by another event"? If we move recoverMigrationStateFromZK to another event, we don't need to add extra retryHandler.

That's a good quesiton, @chia7712 ! Let me think about it.

Also, the solution offered by this PR has a side effect that we will put 2 PollEvent if the exception MigrationClientException happens in other migrationState

No, as you said above, the MigrationClientException retryHandler won't be triggered in other migrationState because they will be handled in other event handler, which is not related to pollEvent. And because the default retryHandler is no-op, there will be no retry for other migrationStates. As long as pollEvent is keep polling, they can be retried later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, as you said above, the MigrationClientException retryHandler won't be triggered in other migrationState because they will be handled in other event handler, which is not related to pollEvent. And because the default retryHandler is no-op, there will be no retry for other migrationStates. As long as pollEvent is keep polling, they can be retried later.

you are right :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chia7712 , I take your suggestion to add RecoverMigrationStateFromZKEvent so that we don't need to worry about retry anymore. I was checking if this change will cause any side effect, and here is my finding:

  1. recoverMigrationStateFromZK is expected to run before the driver starts the state machine.
  2. In the recoverMigrationStateFromZK, we'll do these things:
    a. create a ZNode for migration and initial migration state
    b. install this class as a metadata publisher
    c. transition to INACTIVE state
  3. If this recoverMigrationStateFromZK is keep failing, the log will keep outputting errors and keep retrying. Once it succeeds, the metadata publisher will be installed and the onControllerChange and onMetadataUpdate will be triggered to start the process. That means, if we change recoverMigrationStateFromZK into an event, it won't affect anything because what we need to do at this state is just waiting for the (a)(b)(c) operation completes.

So, I'm +1 with this suggestion. Thank you.

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@showuon thanks for sharing the survey. Leave some comments below.

However, I'm still not sure why #12998 did not use event to recover migration state from zk. @mumrah @cmccabe Could you take a look and share your views to me? thanks!

switch (migrationState) {
case UNINITIALIZED:
recoverMigrationStateFromZK();
eventQueue.append(new RecoverMigrationStateFromZKEvent());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use prepend to make sure this event is executed ASAP

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need. Like I said in this comment, in the UNINITIALIZED state, the only event we will receive is the pollEvent. We'll receive additionalonControllerChange (KRaftLeaderEvent) and onMetadataUpdate (MetadataChangeEvent) after completing RecoverMigrationStateFromZKEvent. So, we don't have to worry about the order at this moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we allowing a race between the RecoverMigrationStateFromZKEvent and the next PollEvent scheduled after the switch? Maybe this could be more straightforward if we only schedule the next poll once RecoverMigrationStateFromZKEvent finishes, either normally or exceptionally? WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I don't think my question makes sense. The following PollEvent can only after RecoverMigrationStateFromZKEvent finishes.

Comment on lines +947 to +952
// Wait until the driver has recovered MigrationState From ZK. This is to simulate the driver needs to be installed as the metadata publisher
// so that it can receive onControllerChange (KRaftLeaderEvent) and onMetadataUpdate (MetadataChangeEvent) events.
private void startAndWaitForRecoveringMigrationStateFromZK(KRaftMigrationDriver driver) throws InterruptedException {
driver.start();
TestUtils.waitForCondition(() -> driver.migrationState().get(1, TimeUnit.MINUTES).equals(MigrationDriverState.INACTIVE),
"Waiting for KRaftMigrationDriver to enter INACTIVE state");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary now because in the test suite, we might invoke onControllerChange to append KRaftLeaderEvent before the RecoverMigrationStateFromZKEvent is appended. This won't happen in practice because the driver needs to wait until RecoverMigrationStateFromZKEvent completed to register metadata publisher to receive KRaftLeaderEvent and MetadataChangeEvent.

Comment on lines 939 to +940
TestUtils.waitForCondition(() -> driver.migrationState().get(1, TimeUnit.MINUTES).equals(MigrationDriverState.DUAL_WRITE),
"Waiting for KRaftMigrationDriver to enter ZK_MIGRATION state");
"Waiting for KRaftMigrationDriver to enter DUAL_WRITE state");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side fix.

@showuon
Copy link
Member Author

showuon commented Apr 22, 2024

@akhileshchg @mumrah @cmccabe , could you take a look when available. Thanks.

@showuon
Copy link
Member Author

showuon commented Apr 25, 2024

@akhileshchg @mumrah @cmccabe , we need your comment on this. Thanks.

Copy link
Contributor

@akhileshchg akhileshchg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Override
public void run() throws Exception {
if (checkDriverState(MigrationDriverState.UNINITIALIZED, this)) {
applyMigrationOperation("Recovering migration state from ZK", zkMigrationClient::getOrCreateMigrationRecoveryState);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, was this the line where uncaught exception is thrown? Can we handle the exception more gracefully and log and error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is where the uncaught exception thrown. The exception will be handled by its parent
MigrationEvent#handleException, and we'll log error there, and even call the faultHandler.handleFault to handle fatal errors.

public void handleException(Throwable e) {
if (e instanceof MigrationClientAuthException) {
KRaftMigrationDriver.this.faultHandler.handleFault("Encountered ZooKeeper authentication in " + this, e);
} else if (e instanceof MigrationClientException) {
log.info(String.format("Encountered ZooKeeper error during event %s. Will retry.", this), e.getCause());
} else if (e instanceof RejectedExecutionException) {
log.debug("Not processing {} because the event queue is closed.", this);
} else {
KRaftMigrationDriver.this.faultHandler.handleFault("Unhandled error in " + this, e);

Thanks.

@showuon
Copy link
Member Author

showuon commented Apr 29, 2024

@akhileshchg , thanks for the review and the approval. But just curious:

However, I'm still not sure why #12998 did not use event to recover migration state from zk.

Do we have any special reason for it?

@showuon
Copy link
Member Author

showuon commented Apr 29, 2024

@soarez @chia7712 , since the original author @akhileshchg had reviewed and approved, do you have any other comments?

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@soarez soarez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @showuon

@showuon showuon merged commit ec151c8 into apache:trunk Apr 29, 2024
@showuon
Copy link
Member Author

showuon commented Apr 29, 2024

Thanks all for the review!

showuon added a commit that referenced this pull request Apr 29, 2024
…rors (#15732)

When running ZK migrating to KRaft process, we encountered an issue that the migrating is hanging and the ZkMigrationState cannot move to MIGRATION state. And it is because the pollEvent didn't retry with the retriable MigrationClientException (ZK client retriable errors) while it should. This PR fixes it and add test. And because of this, the poll event will not poll anymore, which causes the KRaftMigrationDriver hanging.

Reviewers: Luke Chen <showuon@gmail.com>, Igor Soarez<soarez@apple.com>, Akhilesh C <akhileshchg@users.noreply.github.com>
gongxuanzhang pushed a commit to gongxuanzhang/kafka that referenced this pull request Jun 12, 2024
…rors (apache#15732)

When running ZK migrating to KRaft process, we encountered an issue that the migrating is hanging and the ZkMigrationState cannot move to MIGRATION state. And it is because the pollEvent didn't retry with the retriable MigrationClientException (ZK client retriable errors) while it should. This PR fixes it and add test. And because of this, the poll event will not poll anymore, which causes the KRaftMigrationDriver hanging.

Reviewers: Luke Chen <showuon@gmail.com>, Igor Soarez<soarez@apple.com>, Akhilesh C <akhileshchg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments