Skip to content

Conversation

@rajinisivaram
Copy link
Contributor

KIP-699 added support for batching in FindCoordinatorRequest using a new protocol that changes the wire format for both batched and unbatched requests. Clients were updated to try the new format first and switch irreversibly to the old format if the new format is not supported on one broker. During rolling upgrade (or a downgrade), it is possible that a broker doesn't support new format at some point while other brokers do at a later point. Clients end up in a bad state until restarted since use new version with old format. This PR changes FindCoordinatorRequest to set data based on actual version when a single group is used. This is always the case for consumer coordinator and transaction coordinator. For admin API, we still switch to unbatched mode on failure, but the data is set based on actual version, so we never fail even if brokers are upgraded/downgraded.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@rajinisivaram rajinisivaram requested review from dajac and mimaison July 3, 2021 10:33
@rajinisivaram rajinisivaram force-pushed the KAFKA-13029-coordinator-batch branch from ba2bee9 to 59adf86 Compare July 3, 2021 12:16
Copy link
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajinisivaram , thanks for the PR. Nice tests! Left some comments.


public class FindCoordinatorRequest extends AbstractRequest {

public static final short MIN_BATCHED_VERSION = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor!

if (version < MIN_BATCHED_VERSION) {
if (batchedKeys > 1)
throw new NoBatchedFindCoordinatorsException("Cannot create a v" + version + " FindCoordinator request " +
"because we require features supported only in 4 or later.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "because we require features supported only in" + MIN_BATCHED_VERSION + "or later.");

abstract AbstractRequest.Builder<?> requestBuilder();

abstract void handleResponse(AbstractResponse responseBody);
abstract void handleResponse(AbstractResponse responseBody, short requestVersion);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add a comment or java doc here to describe why requestVersion is necessary here.

}
FindCoordinatorResponse response = (FindCoordinatorResponse) abstractResponse;
if (batch) {
if (!response.data().coordinators().isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a comment here to explain what case it is (ex: batch response case)

@rajinisivaram
Copy link
Contributor Author

@showuon Thanks for the review, addressed comments.

Copy link
Member

@mimaison mimaison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rajinisivaram for this fix!

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajinisivaram thanks for this nice patch. This patch also resolves the slow poll when requesting to older broker (#10963). a couple of comments below. PTAL

} else {
data.setKey(this.rebalanceConfig.groupId);
}
data.setKey(this.rebalanceConfig.groupId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line can be merged. for example:

        FindCoordinatorRequestData data = new FindCoordinatorRequestData()
                .setKeyType(CoordinatorType.GROUP.id())
                .setKey(this.rebalanceConfig.groupId);

FindCoordinatorResponse response = (FindCoordinatorResponse) abstractResponse;
if (batch) {
// Handle response based on whether batched version is used by checking `coordinators()` in the response data
if (!response.data().coordinators().isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of consumer coordinator and transaction coordinator parse response according to version. By contrast, it check coordinators. Could we unify the behavior? Could consumer coordinator and transaction coordinator check coordinators instead? It seems checking coordinators can simplify the code as we don't need to pass request version to each handler.

} else {
data.setKey(coordinatorKey);
}
data.setKey(coordinatorKey);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto (code style)

Copy link
Member

@dajac dajac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rajinisivaram Thanks for the fix! I left few suggestions.

public void onSuccess(ClientResponse resp, RequestFuture<Void> future) {
log.debug("Received FindCoordinator response {}", resp);

boolean batch = resp.requestHeader().apiVersion() >= FindCoordinatorRequest.MIN_BATCHED_VERSION;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a coordinators method to FindCoordinatorResponse which would either return the list of coordinators (data.coordinators()) if not empty or would return a list containing a Coordinator created from the top level information. That would remove all the batch checks below.

public void handleResponse(AbstractResponse response, short requestVersion) {
FindCoordinatorResponse findCoordinatorResponse = (FindCoordinatorResponse) response;
CoordinatorType coordinatorType = CoordinatorType.forId(builder.data().keyType());
boolean batchFindCoordinator = requestVersion >= FindCoordinatorRequest.MIN_BATCHED_VERSION;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above suggestion would also us to avoid having to pass the requestVersion down here.

env.kafkaClient().setNodeApiVersions(
NodeApiVersions.create(Arrays.asList(findCoordinatorV3, describeGroups)));

// dummy response for MockClient to handle the UnsupportedVersionException correctly to switch from batched to un-batched
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dummy -> Dummy?


// dummy response for MockClient to handle the UnsupportedVersionException correctly to switch from batched to un-batched
env.kafkaClient().prepareResponse(null);
//Retriable FindCoordinatorResponse errors should be retried
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we add a space before Retriable?

@rajinisivaram
Copy link
Contributor Author

@chia7712 @dajac Thanks for the reviews, addressed comments.

Copy link
Member

@dajac dajac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, assuming tests pass. Thanks!

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rajinisivaram
Copy link
Contributor Author

@showuon @chia7712 @dajac Thanks for the reviews. Test failure (RaftClusterTest.testCreateClusterAndCreateAndManyTopics) not related. Merging to trunk.

@rajinisivaram rajinisivaram merged commit 2b6365c into apache:trunk Jul 5, 2021
xdgrulez pushed a commit to xdgrulez/kafka that referenced this pull request Dec 22, 2021
…on version (apache#10965)

KIP-699 added support for batching in FindCoordinatorRequest using a new protocol that changes the wire format for both batched and unbatched requests. Clients were updated to try the new format first and switch irreversibly to the old format if the new format is not supported on one broker. During rolling upgrade (or a downgrade), it is possible that a broker doesn't support new format at some point while other brokers do at a later point. Clients end up in a bad state until restarted since they use new version with old format. This PR changes FindCoordinatorRequest to set data based on actual version when a single group is used. This is always the case for consumer coordinator and transaction manager. For admin API, we still switch to unbatched mode on failure, but the data is set based on actual version, so we never fail even if brokers are upgraded/downgraded.

Reviewers: Luke Chen <showuon@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>, David Jacot <djacot@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants