Skip to content

KAFKA-13114: Revert state and reregister raft listener#11116

Merged
hachikuji merged 6 commits intoapache:trunkfrom
jsancio:kafka-13114-cond-revert-renounce
Aug 1, 2021
Merged

KAFKA-13114: Revert state and reregister raft listener#11116
hachikuji merged 6 commits intoapache:trunkfrom
jsancio:kafka-13114-cond-revert-renounce

Conversation

@jsancio
Copy link
Member

@jsancio jsancio commented Jul 23, 2021

RaftClient's scheduleAppend may split the list of records into multiple
batches. This means that it is possible for the active controller to
see a committed offset for which it doesn't have an in-memory snapshot.

If the active controller needs to renounce and it is missing an
in-memory snapshot, then revert the state and reregister the Raft
listener. This will cause the controller to replay the entire metadata
partition.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@jsancio jsancio force-pushed the kafka-13114-cond-revert-renounce branch from a23e1a3 to 6b70e48 Compare July 23, 2021 22:55
RaftClient's scheduleAppend may split the list of records into multiple
batches. This means that it is possible for the active controller to
see a committed offset for which it doesn't have an in-memory snapshot.

If the active controller needs to renounce and it is missing an
in-memory snapshot, then revert the state and register a new listener.
This will cause the controller to replay the entire metadata partition.
@jsancio jsancio force-pushed the kafka-13114-cond-revert-renounce branch from 6b70e48 to dda2b3d Compare July 24, 2021 17:27
@jsancio jsancio marked this pull request as ready for review July 24, 2021 17:28
Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change seems reasonable, but should we add test cases?

Copy link
Member Author

@jsancio jsancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change seems reasonable, but should we add test cases?

It wasn't clear to me how to do this with the current code but let me see if I can figure something out.

@hachikuji
Copy link
Contributor

I'm trying to think of some approach for validating this logic. It is difficult because it is handling unexpected exceptions. One thought I had is implementing a poison message of some kind which could expire after some TTL. When the controller sees the poison message, it would check if it is still active and raise an exception accordingly. Something like that could be used in an integration test, which might be simpler than trying to induce a failure by mucking with internal state.

Another idea is to corrupt the log on one of the nodes, but I'm not sure this would hit the right path. In fact, this is probably a gap at the moment. If the batch reader fails during iteration, we should probably resign and perhaps even fail. I'll file a separate JIRA for this.

In any case, I think we should try to come up with some way to exercise this path. Otherwise it's hard to say if it even works (though it looks reasonable enough).

Copy link
Member Author

@jsancio jsancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hachikuji I added a test for this case. I had to update LocalLogManager to better match Raft's leader election pattern. It is still not perfect but it is good enough for this test.

throw new IllegalStateException("The raft client was unable to allocate a buffer for an append");
} else if (offset == Long.MAX_VALUE) {
throw new IllegalStateException("Unable to append records since this is not the leader");
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a partial fix until we merge #10909

.max();

if (firstOffset.isPresent() && resignAfterNonAtomicCommit.getAndSet(false)) {
// Emulate losing leadering in them middle of a non-atomic append by not writing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: losing leadership?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix.

String topicName = "topic-name";

try (LocalLogManagerTestEnv logEnv = new LocalLogManagerTestEnv(1, Optional.empty())) {
try (QuorumControllerTestEnv controlEnv =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we pull this into the first try?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I didn't know that was valid Java.

);

// Wait for the new active controller
final QuorumController newController = controlEnv.activeController();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confused me a little bit since we are trying to verify that the state on the original controller resets properly. That is what is happening here since there is only one controller in the test, but it is obscured a little bit by the new variable. Maybe it would be clearer to use the original reference and write this as:

assertEquals(controller, controlEnv.activeController());

Also, is there an epoch or something we can bump to ensure the transition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Fixed.

Comment on lines +691 to +696
// Wait for the controller to become active again
assertSame(controller, controlEnv.activeController());
assertTrue(
oldClaimEpoch < controller.curClaimEpoch(),
String.format("oldClaimEpoch = %s, newClaimEpoch = %s", oldClaimEpoch, controller.curClaimEpoch())
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only this should have changed. The rest are indentation changes from the previous commit.

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One minor suggestion.

…r.java

Co-authored-by: Jason Gustafson <jason@confluent.io>
@hachikuji hachikuji merged commit 4efd9bf into apache:trunk Aug 1, 2021
hachikuji pushed a commit that referenced this pull request Aug 1, 2021
RaftClient's scheduleAppend may split the list of records into multiple
batches. This means that it is possible for the active controller to
see a committed offset for which it doesn't have an in-memory snapshot.

If the active controller needs to renounce and it is missing an
in-memory snapshot, then revert the state and reregister the Raft
listener. This will cause the controller to replay the entire metadata
partition.

Reviewers: Jason Gustafson <jason@confluent.io>
@jsancio jsancio deleted the kafka-13114-cond-revert-renounce branch August 11, 2021 17:24
xdgrulez pushed a commit to xdgrulez/kafka that referenced this pull request Dec 22, 2021
RaftClient's scheduleAppend may split the list of records into multiple
batches. This means that it is possible for the active controller to
see a committed offset for which it doesn't have an in-memory snapshot.

If the active controller needs to renounce and it is missing an
in-memory snapshot, then revert the state and reregister the Raft
listener. This will cause the controller to replay the entire metadata
partition.

Reviewers: Jason Gustafson <jason@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants