KAFKA-16563: retry pollEvent in KRaftMigrationDriver for retriable errors#15732
KAFKA-16563: retry pollEvent in KRaftMigrationDriver for retriable errors#15732showuon merged 4 commits intoapache:trunkfrom
Conversation
| // Use no-op handler by default because the handleException will be overridden if needed | ||
| private Consumer<Throwable> retryHandler = NO_OP_HANDLER; | ||
|
|
||
| public void retryHandler(Consumer<Throwable> retryHandler) { | ||
| this.retryHandler = retryHandler; | ||
| } |
There was a problem hiding this comment.
Did you consider simply defining an empty public void retryHandler(Throwable thrown) that PollEvent can override?
There was a problem hiding this comment.
Also, should we call wakeup (run next poll ASAP) rather that scheduleDeferred if the exception is retryable?
There was a problem hiding this comment.
Did you consider simply defining an empty public void retryHandler(Throwable thrown) that PollEvent can override?
Nice suggestion! Updated!
Also, should we call wakeup (run next poll ASAP) rather that scheduleDeferred if the exception is retryable?
Thanks for the suggestion. I think that's not appropriate because if the retriable error needs some time to be fixed (ex: the ZK connection issue), the pollEvent will be invoked a lot of times (and keep retrying) in a short period of time.
| KRaftMigrationDriver.this.faultHandler.handleFault("Encountered ZooKeeper authentication in " + this, e); | ||
| } else if (e instanceof MigrationClientException) { | ||
| log.info(String.format("Encountered ZooKeeper error during event %s. Will retry.", this), e.getCause()); | ||
| retryHandler(); |
There was a problem hiding this comment.
I feel the retry is existent except for UNINITIALIZED since UNINITIALIZED is not running by another event. For other event type, PollEvent will put (do-something event + one deferred PollEvent) to the queue. It means the deferred PollEvent is the "retry".
My question is "why we did not handle UNINITIALIZED by another event"? If we move recoverMigrationStateFromZK to another event, we don't need to add extra retryHandler.
There was a problem hiding this comment.
Also, the solution offered by this PR has a side effect that we will put 2 PollEvent if the exception MigrationClientException happens in other migrationState
There was a problem hiding this comment.
My question is "why we did not handle UNINITIALIZED by another event"? If we move recoverMigrationStateFromZK to another event, we don't need to add extra retryHandler.
That's a good quesiton, @chia7712 ! Let me think about it.
Also, the solution offered by this PR has a side effect that we will put 2 PollEvent if the exception MigrationClientException happens in other migrationState
No, as you said above, the MigrationClientException retryHandler won't be triggered in other migrationState because they will be handled in other event handler, which is not related to pollEvent. And because the default retryHandler is no-op, there will be no retry for other migrationStates. As long as pollEvent is keep polling, they can be retried later.
There was a problem hiding this comment.
No, as you said above, the MigrationClientException retryHandler won't be triggered in other migrationState because they will be handled in other event handler, which is not related to pollEvent. And because the default retryHandler is no-op, there will be no retry for other migrationStates. As long as pollEvent is keep polling, they can be retried later.
you are right :)
There was a problem hiding this comment.
@chia7712 , I take your suggestion to add RecoverMigrationStateFromZKEvent so that we don't need to worry about retry anymore. I was checking if this change will cause any side effect, and here is my finding:
recoverMigrationStateFromZKis expected to run before the driver starts the state machine.- In the
recoverMigrationStateFromZK, we'll do these things:
a. create a ZNode for migration and initial migration state
b. install this class as a metadata publisher
c. transition to INACTIVE state - If this
recoverMigrationStateFromZKis keep failing, the log will keep outputting errors and keep retrying. Once it succeeds, the metadata publisher will be installed and theonControllerChangeandonMetadataUpdatewill be triggered to start the process. That means, if we changerecoverMigrationStateFromZKinto an event, it won't affect anything because what we need to do at this state is just waiting for the (a)(b)(c) operation completes.
So, I'm +1 with this suggestion. Thank you.
metadata/src/main/java/org/apache/kafka/metadata/migration/KRaftMigrationDriver.java
Outdated
Show resolved
Hide resolved
| switch (migrationState) { | ||
| case UNINITIALIZED: | ||
| recoverMigrationStateFromZK(); | ||
| eventQueue.append(new RecoverMigrationStateFromZKEvent()); |
There was a problem hiding this comment.
Should we use prepend to make sure this event is executed ASAP
There was a problem hiding this comment.
No need. Like I said in this comment, in the UNINITIALIZED state, the only event we will receive is the pollEvent. We'll receive additionalonControllerChange (KRaftLeaderEvent) and onMetadataUpdate (MetadataChangeEvent) after completing RecoverMigrationStateFromZKEvent. So, we don't have to worry about the order at this moment.
There was a problem hiding this comment.
Are we allowing a race between the RecoverMigrationStateFromZKEvent and the next PollEvent scheduled after the switch? Maybe this could be more straightforward if we only schedule the next poll once RecoverMigrationStateFromZKEvent finishes, either normally or exceptionally? WDYT?
There was a problem hiding this comment.
On second thought, I don't think my question makes sense. The following PollEvent can only after RecoverMigrationStateFromZKEvent finishes.
| // Wait until the driver has recovered MigrationState From ZK. This is to simulate the driver needs to be installed as the metadata publisher | ||
| // so that it can receive onControllerChange (KRaftLeaderEvent) and onMetadataUpdate (MetadataChangeEvent) events. | ||
| private void startAndWaitForRecoveringMigrationStateFromZK(KRaftMigrationDriver driver) throws InterruptedException { | ||
| driver.start(); | ||
| TestUtils.waitForCondition(() -> driver.migrationState().get(1, TimeUnit.MINUTES).equals(MigrationDriverState.INACTIVE), | ||
| "Waiting for KRaftMigrationDriver to enter INACTIVE state"); |
There was a problem hiding this comment.
This is necessary now because in the test suite, we might invoke onControllerChange to append KRaftLeaderEvent before the RecoverMigrationStateFromZKEvent is appended. This won't happen in practice because the driver needs to wait until RecoverMigrationStateFromZKEvent completed to register metadata publisher to receive KRaftLeaderEvent and MetadataChangeEvent.
| TestUtils.waitForCondition(() -> driver.migrationState().get(1, TimeUnit.MINUTES).equals(MigrationDriverState.DUAL_WRITE), | ||
| "Waiting for KRaftMigrationDriver to enter ZK_MIGRATION state"); | ||
| "Waiting for KRaftMigrationDriver to enter DUAL_WRITE state"); |
|
@akhileshchg @mumrah @cmccabe , could you take a look when available. Thanks. |
|
@akhileshchg @mumrah @cmccabe , we need your comment on this. Thanks. |
| @Override | ||
| public void run() throws Exception { | ||
| if (checkDriverState(MigrationDriverState.UNINITIALIZED, this)) { | ||
| applyMigrationOperation("Recovering migration state from ZK", zkMigrationClient::getOrCreateMigrationRecoveryState); |
There was a problem hiding this comment.
For my understanding, was this the line where uncaught exception is thrown? Can we handle the exception more gracefully and log and error?
There was a problem hiding this comment.
Yes, this is where the uncaught exception thrown. The exception will be handled by its parent
MigrationEvent#handleException, and we'll log error there, and even call the faultHandler.handleFault to handle fatal errors.
Thanks.
|
@akhileshchg , thanks for the review and the approval. But just curious:
Do we have any special reason for it? |
|
@soarez @chia7712 , since the original author @akhileshchg had reviewed and approved, do you have any other comments? |
|
Thanks all for the review! |
…rors (#15732) When running ZK migrating to KRaft process, we encountered an issue that the migrating is hanging and the ZkMigrationState cannot move to MIGRATION state. And it is because the pollEvent didn't retry with the retriable MigrationClientException (ZK client retriable errors) while it should. This PR fixes it and add test. And because of this, the poll event will not poll anymore, which causes the KRaftMigrationDriver hanging. Reviewers: Luke Chen <showuon@gmail.com>, Igor Soarez<soarez@apple.com>, Akhilesh C <akhileshchg@users.noreply.github.com>
…rors (apache#15732) When running ZK migrating to KRaft process, we encountered an issue that the migrating is hanging and the ZkMigrationState cannot move to MIGRATION state. And it is because the pollEvent didn't retry with the retriable MigrationClientException (ZK client retriable errors) while it should. This PR fixes it and add test. And because of this, the poll event will not poll anymore, which causes the KRaftMigrationDriver hanging. Reviewers: Luke Chen <showuon@gmail.com>, Igor Soarez<soarez@apple.com>, Akhilesh C <akhileshchg@users.noreply.github.com>
When running ZK migrating to KRaft process, we encountered an issue that the migrating is hanging and the
ZkMigrationStatecannot move toMIGRATIONstate. And it is because the pollEvent didn't retry with the retriableMigrationClientException(ZK client retriable errors) while it should. This PR fixes it and add test. And because of this, the poll event will not poll anymore, which causes the KRaftMigrationDriver hanging.We could consider to let the leader node do the znode creation only to avoid this conflict issue. But that will be another improvement.
Committer Checklist (excluded from commit message)