KIP-699: Update FindCoordinator to resolve multiple Coordinators at a time #10743

mimaison · 2021-05-21T17:23:26Z

This PR implements KIP-699

It updates FindCoordinator request and response to support resolving multiple coordinators at a time. If a broker does not support the new FindCoordinator version, clients can revert to the previous behaviour and use a request for each coordinator.

All methods in Admin that require looking up coordinators have been updated to use the new AdminApiDriver logic.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

skaundinya15 · 2021-05-27T22:41:14Z

@mimaison Should be worth noting that there already exists a PR to simplify the ListOffsets API here: #10467, so if that gets merged soon, you may need to rebase your implementation on top of that.

mimaison · 2021-05-28T09:55:04Z

@skaundinya15 Yeah we may have to rebase but that shouldn't be an issue. I did not touch ListOffsets in this PR, I'm only updating methods that interact with coordinators

mimaison · 2021-06-04T17:36:29Z

@rajinisivaram @tombentley @dajac Can you take a look? Thanks

dajac · 2021-06-05T05:47:20Z

@mimaison Could you briefly describe the core changes that you have made in the PR? That would be helpful to dive into it. Thanks!

mimaison · 2021-06-06T21:28:26Z

@dajac Right, I've updated the description to give some context

tombentley

Made a first pass with a few questions.

clients/src/main/java/org/apache/kafka/clients/admin/DeleteConsumerGroupsResult.java

clients/src/main/java/org/apache/kafka/clients/admin/internals/AdminApiHandler.java

clients/src/main/java/org/apache/kafka/clients/admin/internals/CoordinatorKey.java

...ts/src/main/java/org/apache/kafka/clients/admin/internals/DescribeConsumerGroupsHandler.java

clients/src/main/java/org/apache/kafka/clients/admin/internals/AdminApiDriver.java

clients/src/main/java/org/apache/kafka/clients/admin/internals/CoordinatorStrategy.java

tombentley · 2021-06-07T15:01:18Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

Is there some reason why we can't pass the ApiVersions into the coordinator, so we can get the batching right without needing to retry like this?

I've only taken a very brief look and I think this approach would work well for Connect, Producer and Consumer, however it's a bit more complicated with Admin.

In Admin, requests are built by lookup strategies. Lookups can be sent to any broker so knowing the max version for a specific call is not completely trivial. That said, it's not impossible either so if there's concensus it would be preferable I can give that a try.

This AdminApiDriver abstraction is pretty new to me, so I might be wide of the mark, but it doesn't appear to handle this very well. The lookup strategy has to build a request without knowing either the broker or the API versions. It would be possible to pass the ApiVersions to the CoordinatorStrategy, which should let you do the right thing based on the minimum of the FindCoordinator API version supported in the whole cluster. (That's not completely perfect, since really you'd want to decide on a per-broker basis, but I think it would be good enough). Sadly it's not quite enough to pass just ApiVersions, since it doesn't really know about the nodes in the cluster, so you'd need to pass Metadata too, which is quite a lot of work. So I can understand why it makes sense to do it like this, since it lets you benefit from the existing logic for figuring out request versions.

@tombentley That's a really good point, perhaps we can file a JIRA in the future to add support to the AdminApiDrive to support passing in versions for different API calls. I'm currently working on KIP-709 where I'm trying to introduce batching to the fetch offsets API, and I'll need to do something similar as Mickael has done here to be able to support all versions, but having some native support to be able to know the version beforehand would be much more desirable.

core/src/main/scala/kafka/server/KafkaApis.scala

skaundinya15

Thanks for the PR @mimaison, fantastic job with the refactor - makes so much of the code much more readable! I made a pass through the non test files and left some comments. Will take another pass at the test files soon.

skaundinya15 · 2021-06-17T01:01:01Z

...src/main/java/org/apache/kafka/clients/admin/internals/AlterConsumerGroupOffsetsHandler.java

Seems like there's a copy paste error, should be AlterConsumerGroupOffsets. Also, when I compare to the handleError() in DescribeConsumerGroupsHandler, it seems to be slightly different:

private void handleError(CoordinatorKey groupId, Errors error, Map<CoordinatorKey, Throwable> failed, List<CoordinatorKey> unmapped) { switch (error) { case GROUP_AUTHORIZATION_FAILED: log.error("Received authorization failure for group {} in `DescribeGroups` response", groupId, error.exception()); failed.put(groupId, error.exception()); break; case COORDINATOR_LOAD_IN_PROGRESS: case COORDINATOR_NOT_AVAILABLE: break; case NOT_COORDINATOR: log.debug("DescribeGroups request for group {} returned error {}. Will retry", groupId, error); unmapped.add(groupId); break; default: log.error("Received unexpected error for group {} in `DescribeGroups` response", groupId, error.exception()); failed.put(groupId, error.exception( "Unexpected error during DescribeGroups lookup for " + groupId)); } }

In the AlterConsumerGroupOffsetsHandler we don't break on COORDINATOR_NOT_AVAILABLE, but we do in DescribeConsumerGroupsHandler - any reason for this? I'd imagine the error handling across all the consumer groups handler would be the same. Perhaps we could factor this out and put it in some generic ConsumerGroupHandler class that implements this, and takes in a request name so it can be used across all the consumer groups handlers. What do you think?

Yes some of these handlers different in pretty subtle ways. Unfortunately if we remove this break statement, some of the existing tests stop working. In this PR, I've aimed at not touching the exiting tests to ensure my refactoring don't change the current behaviour.

I agree, this is very confusing and I think we should align behaviour for all these similar calls and have a generic error handling logic. But I would prefer to not do it as part of this PR as it's already pretty big.

@mimaison makes sense, it would be good to file a JIRA for this so we can address it in a future PR to ensure we have consistent error handling across all consumer group related issues.

I agree, I raised https://issues.apache.org/jira/browse/KAFKA-13012

clients/src/main/java/org/apache/kafka/clients/admin/internals/CoordinatorKey.java

clients/src/main/java/org/apache/kafka/clients/admin/internals/CoordinatorStrategy.java

skaundinya15 · 2021-06-17T01:29:07Z

clients/src/main/java/org/apache/kafka/clients/admin/internals/CoordinatorStrategy.java

Would we consider something like COORDINATOR_NOT_AVAILABLE a transient exception that we should retry on? I'm a bit confused since we don't log anything or do a break on this case. In either case, I think we should log + put it in the failedKeys map if we think that that kind of exception should be considered a failure, or let it be if we think it should be a more transient error.

This is the current behaviour, when hitting these errors, we retry the find coordinator request. The way to signal we want to retry is to omit the key in the return value of handleResponse(), see https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/admin/internals/AdminApiLookupStrategy.java#L71-L74

So I think we want to keep it as it is.

@mimaison Gotcha, makes sense. Can we still we can still log at DEBUG level? In case we need to debug, we have evidence of this in the logs.

clients/src/main/java/org/apache/kafka/clients/admin/internals/CoordinatorStrategy.java

clients/src/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupsHandler.java

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java

clients/src/main/resources/common/message/FindCoordinatorRequest.json

mimaison · 2021-06-17T17:56:51Z

@skaundinya15 Thanks for the review. I believe I've addressed all your comments.

tombentley

This is looking pretty good a few more nits and questions.

clients/src/main/java/org/apache/kafka/clients/admin/DeleteConsumerGroupOffsetsResult.java

tombentley · 2021-06-18T10:44:11Z

clients/src/main/java/org/apache/kafka/clients/admin/internals/AdminApiHandler.java

Is it necessarily a broker, or could it be a kraft controller?

Not entirely sure about the new naming in place now, but does that still count as a broker?

Well process.roles is documented like this in config/kraft/README.md:

If process.roles is set to broker, the server acts as a broker in KRaft mode.

If process.roles is set to controller, the server acts as a controller in KRaft mode.

If process.roles is set to broker,controller, the server acts as both a broker and a controller in KRaft mode.

which suggests to me that we're using the term "broker" to mean "a thing which handles Produce and Fetch etc". (However, "controller" is a bit confusing there, since while we might have several servers in the "controller" role only one will be the controller at any one time. I'm not aware of a good term for "server that is participating in the raft cluster, but might not be the current controller right now").

Ok, thanks for the clarifications. I feel like broker is still fine here. Otherwise, maybe node?

...src/main/java/org/apache/kafka/clients/admin/internals/AlterConsumerGroupOffsetsHandler.java

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

clients/src/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupsHandler.java

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java

clients/src/test/java/org/apache/kafka/clients/MockClient.java

mimaison · 2021-06-18T17:59:35Z

@tombentley Thanks for the review! I've addressed your findings

tombentley

LGTM

dajac

Thanks for the PR. I have made a pass over it. Overall, it looks good. A small suggestion: It would be great if we could unify the formatting style. For instance, the style and the indentation of the method declarations is not consistent accros the board.

clients/src/main/java/org/apache/kafka/clients/admin/ConsumerGroupDescription.java

...src/main/java/org/apache/kafka/clients/admin/internals/AlterConsumerGroupOffsetsHandler.java

.../src/main/java/org/apache/kafka/clients/admin/internals/ListConsumerGroupOffsetsHandler.java

mimaison · 2021-06-18T21:55:55Z

Thanks @dajac for the review, I've addressed your comments

skaundinya15

Thanks for the updates @mimaison! Just left a couple minor comments, other than that LGTM! Really excited for this patch to make it in :)

.../src/main/java/org/apache/kafka/clients/admin/internals/ListConsumerGroupOffsetsHandler.java

...ain/java/org/apache/kafka/clients/admin/internals/RemoveMembersFromConsumerGroupHandler.java

clients/src/main/resources/common/message/FindCoordinatorRequest.json

skaundinya15 · 2021-06-22T06:44:41Z

clients/src/main/java/org/apache/kafka/common/requests/FindCoordinatorRequest.java

Why do we call the prepareOldResponse in getErrorResponse? Should we be doing a version check here?

clients/src/main/java/org/apache/kafka/common/requests/FindCoordinatorResponse.java

dajac

@mimaison Thanks for the updates. I left a few more comments/questions.

clients/src/main/java/org/apache/kafka/clients/admin/DeleteConsumerGroupsResult.java

clients/src/main/java/org/apache/kafka/clients/admin/DescribeConsumerGroupsResult.java

clients/src/main/java/org/apache/kafka/clients/admin/internals/CoordinatorKey.java

dajac · 2021-06-22T06:34:50Z

clients/src/main/java/org/apache/kafka/clients/admin/internals/CoordinatorStrategy.java

nit: I wonder if differentiating the two is required here. It seems that we don't use them anywhere else so we could use one common REQUEST_SCOPE for both cases.

Right, I merged both as BATCH_REQUEST_SCOPE

dajac · 2021-06-22T06:45:05Z

clients/src/main/java/org/apache/kafka/common/errors/NoBatchedFindCoordinatorsException.java

If I remember correctly, this new exception was not specified in the KIP. Should we update it and notify the thread in the mailing list?

Right, I'll update the KIP and thread

It seems like a mistake to add this to the public errors package when it should never propagate outside the client. An alternative could to be make it a static member class of FindCoordinatorRequest, or put it in some other package.

Yeah, I was also thinking about this. It is definitely better if we could keep it private.

I agree, I'll move it to an inner class of FindCoordinatorRequest and update the KIP accordingly

We need to discuss this further.

I pushed a change that moves NoBatchedFindCoordinatorsException into FindCoordinatorRequest. Do you have any further concerns?

Nope, that's fine. Thanks.

dajac · 2021-06-22T06:47:14Z

clients/src/test/java/org/apache/kafka/clients/MockClient.java

I wonder if we should catch UnsupportedVersionException here to handle the general case. Have you considered it? Also, the name uble looks weird. Is it intentional?

Yes I looked into it but I think there's 1 test in TransactionManagerTest that expects UnsupportedVersionException to be thrown.
MockClient works in slightly different ways than the real client and it would be good to address this but I'd rather defer to a follow up PR.

Understood. Should we file a Jira about this?

Sure, I raised https://issues.apache.org/jira/browse/KAFKA-13000

clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java

clients/src/main/java/org/apache/kafka/clients/admin/internals/AdminApiDriver.java

clients/src/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupsHandler.java

clients/src/main/java/org/apache/kafka/clients/admin/internals/CoordinatorKey.java

mimaison · 2021-06-24T16:17:32Z

@tombentley @dajac @skaundinya15 Thanks for the reviews! I believe I've addressed all your comments now.

skaundinya15

Left a few very minor nit comments, other than that this looks great to me!

clients/src/main/java/org/apache/kafka/clients/admin/DescribeConsumerGroupsResult.java

clients/src/main/java/org/apache/kafka/clients/admin/DeleteConsumerGroupsResult.java

clients/src/main/java/org/apache/kafka/clients/admin/DescribeConsumerGroupsResult.java

dajac

Thanks for the update. I left a few more questions/comments.

dajac · 2021-06-28T08:54:25Z

clients/src/main/java/org/apache/kafka/clients/admin/internals/AdminApiDriver.java

It feels a bit weird to handle a special case like this one in the driver. It is probably OK for the time being but we don't want to add more custom cases like this in the driver, I think. I wonder if we could delegate the decision to the handler. We could add an handleUnsupportedVersionException method to the AdminApiHandler for this purpose to delegate the decision. That method could basically return the keys to unmap and the keys to complete with the exception.

An alternative would be to rely on the the handleUnsupportedVersionException method in Call. The driver could also implement it and still delegate the decision to the handler. The advantage of using this method is that downgrade would not be counted as failures and thus does not count for the retries.

Have you considered something like this?

AdminApiDriver was still evolving rapidly while I was implementing this KIP so I went for the straighforward approach.

But I agree, it would be best to avoid this type of logic here. The goal would be to find a mechanism that works for all clients. @tombentley suggested an alternative option in #10743 (comment).

I've not had the time to look into better alternatives yet.

Should we file a Jira to not forget about improving this?

I opened https://issues.apache.org/jira/browse/KAFKA-13013

dajac · 2021-06-28T08:56:05Z

clients/src/main/java/org/apache/kafka/clients/admin/internals/CoordinatorStrategy.java

nit: Indentation seems off here.

dajac · 2021-06-28T08:59:36Z

clients/src/main/java/org/apache/kafka/common/errors/NoBatchedFindCoordinatorsException.java

We need to discuss this further.

dajac · 2021-06-28T09:00:01Z

clients/src/main/resources/common/message/FindCoordinatorRequest.json

nit: Could we add the KIP number here? I would also explain a bit more the change. The comment you have in the response is much better for instance.

dajac · 2021-06-28T09:01:03Z

clients/src/main/resources/common/message/FindCoordinatorResponse.json

nit: Could we add the KIP number here?

dajac

LGTM. @mimaison Thanks for the PR! Could you file the follow-up Jiras that have been discussed in the comments? I have spotted two remaining ones: 1) one to address the error handling in the driver; 2) one to armonize the error handling in the handlers.

mimaison · 2021-06-29T20:54:51Z

@dajac Thanks, I opened JIRAs for all follow up work items we identified.

…at a time Changes for KIP-699

- Make KafkaAdminClientTest tests work with updated MockClient - Update new ConsumerGroupServiceTest tests with new API

mimaison · 2021-07-01T13:33:36Z

I've rebased on trunk to resolve conflicts and I've had to make small changes due to updates on trunk. All new changes are in ae09d74.

I'll let Jenkins run now and I'll merge later if there's no issues.

apache#10743) This implements KIP-699: https://cwiki.apache.org/confluence/display/KAFKA/KIP-699%3A+Update+FindCoordinator+to+resolve+multiple+Coordinators+at+a+time It updates FindCoordinator request and response to support resolving multiple coordinators at a time. If a broker does not support the new FindCoordinator version, clients can revert to the previous behaviour and use a request for each coordinator. Reviewers: David Jacot <djacot@confluent.io>, Tom Bentley <tbentley@redhat.com>, Sanjana Kaundinya <skaundinya@gmail.com>

mimaison force-pushed the kip-699 branch 3 times, most recently from f14f506 to fb65fd3 Compare June 4, 2021 17:30

mimaison marked this pull request as ready for review June 4, 2021 17:33

mimaison changed the title ~~KIP-699: Work in progress~~ KIP-699: Update FindCoordinator to resolve multiple Coordinators at a time Jun 6, 2021

tombentley reviewed Jun 7, 2021

View reviewed changes

skaundinya15 reviewed Jun 17, 2021

View reviewed changes

tombentley reviewed Jun 18, 2021

View reviewed changes

tombentley approved these changes Jun 18, 2021

View reviewed changes

dajac reviewed Jun 18, 2021

View reviewed changes

skaundinya15 approved these changes Jun 21, 2021

View reviewed changes

.../src/main/java/org/apache/kafka/clients/admin/internals/ListConsumerGroupOffsetsHandler.java Outdated Show resolved Hide resolved

...ain/java/org/apache/kafka/clients/admin/internals/RemoveMembersFromConsumerGroupHandler.java Outdated Show resolved Hide resolved

skaundinya15 reviewed Jun 21, 2021

View reviewed changes

clients/src/main/resources/common/message/FindCoordinatorRequest.json Outdated Show resolved Hide resolved

skaundinya15 reviewed Jun 22, 2021

View reviewed changes

clients/src/main/java/org/apache/kafka/common/requests/FindCoordinatorResponse.java Outdated Show resolved Hide resolved

dajac reviewed Jun 22, 2021

View reviewed changes

skaundinya15 reviewed Jun 22, 2021

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupsHandler.java Outdated Show resolved Hide resolved

skaundinya15 reviewed Jun 24, 2021

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/admin/internals/CoordinatorKey.java Outdated Show resolved Hide resolved

skaundinya15 approved these changes Jun 26, 2021

View reviewed changes

dajac reviewed Jun 28, 2021

View reviewed changes

dajac approved these changes Jun 29, 2021

View reviewed changes

mimaison added 12 commits July 1, 2021 10:08

KAFKA-12663: Update FindCoordinator to resolve multiple Coordinators …

6aa62b0

…at a time Changes for KIP-699

Address feedback

4258920

Address feedback

91ae83d

Address feedback + new tests

67bf729

Address feedback

8adf7df

Fix formatting style

0783a08

More formatting fixes

2fb6386

Address feedback

6a6e475

Address comments

28b3cb1

Address feedback

aa0e50f

Mention KIP in json schema, fix indentation

ac0edf6

Update tests to work with trunk

ae09d74

- Make KafkaAdminClientTest tests work with updated MockClient - Update new ConsumerGroupServiceTest tests with new API

mimaison force-pushed the kip-699 branch from 1320921 to ae09d74 Compare July 1, 2021 13:29

mimaison merged commit f5d5f65 into apache:trunk Jul 1, 2021

mimaison deleted the kip-699 branch July 1, 2021 21:05

chia7712 mentioned this pull request Jul 4, 2021

KAFKA-13030 FindCoordinators batching causes slow poll when requestin… #10963

Closed

3 tasks

mimaison mentioned this pull request Jul 4, 2021

KAFKA-13033: COORDINATOR_NOT_AVAILABLE should be unmapped #10973

Closed

3 tasks

KIP-699: Update FindCoordinator to resolve multiple Coordinators at a time #10743

KIP-699: Update FindCoordinator to resolve multiple Coordinators at a time #10743

Uh oh!

Conversation

mimaison commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

skaundinya15 commented May 27, 2021

Uh oh!

mimaison commented May 28, 2021

Uh oh!

mimaison commented Jun 4, 2021

Uh oh!

dajac commented Jun 5, 2021

Uh oh!

mimaison commented Jun 6, 2021

Uh oh!

tombentley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

skaundinya15 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimaison Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skaundinya15 Jun 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mimaison commented Jun 17, 2021

Uh oh!

tombentley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mimaison commented May 21, 2021 •

edited

Loading

mimaison Jun 17, 2021 •

edited

Loading

skaundinya15 Jun 21, 2021 •

edited

Loading