KAFKA-13059: Make DeleteConsumerGroupOffsetsHandler unmap for COORDINATOR_NOT_AVAILABLE error and fix issue #11019

showuon · 2021-07-12T06:56:43Z

Some issues found in the DeleteConsumerGroupOffsetsHandler:

if coordinator errors is put in the topic partition, plus a Errors.NONE, we'll failed with IllegalArgumentException: Partition foo was not included in the original request. This is the new added test case scenario: testDeleteConsumerGroupOffsetsResponseIncludeCoordinatorErrorAndNoneError
Didn't handle all possible exceptions, so there will be "expected" exception, but be logged as "unexpected exception"
In DeleteConsumerGroupOffsetsHandlerTest, we build all errors in partition result, including group error. Split group error tests and partition error tests.

This is the old handle response logic. FYR:

void handleResponse(AbstractResponse abstractResponse) {
    final OffsetDeleteResponse response = (OffsetDeleteResponse) abstractResponse;

    // If coordinator changed since we fetched it, retry
    // note: we use `errorCounts` to collect all errors in the response, including partition errors.
    if (ConsumerGroupOperationContext.hasCoordinatorMoved(response)) {
        Call call = getDeleteConsumerGroupOffsetsCall(context, partitions);
        rescheduleFindCoordinatorTask(context, () -> call, this);
        return;
    }

    // If the error is an error at the group level, the future is failed with it
    final Errors groupError = Errors.forCode(response.data().errorCode());
    if (handleGroupRequestError(groupError, context.future()))
        return;

    final Map<TopicPartition, Errors> partitions = new HashMap<>();
    response.data().topics().forEach(topic -> topic.partitions().forEach(partition -> partitions.put(
        new TopicPartition(topic.name(), partition.partitionIndex()),
        Errors.forCode(partition.errorCode())))
    );

    context.future().complete(partitions);
}

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

showuon · 2021-07-12T06:59:15Z

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

put every error into partitionResults, as the log logic did

showuon · 2021-07-12T07:00:39Z

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

Refer to #11016, we don't return any completed/failed results if we need to retry.

showuon · 2021-07-12T07:04:27Z

clients/src/test/java/org/apache/kafka/clients/admin/KafkaAdminClientTest.java

-    public void testDeleteConsumerGroupOffsets() throws Exception {
-        // Happy path
-
+    public void testDeleteConsumerGroupOffsetsResponseIncludeCoordinatorErrorAndNoneError() throws Exception {


Add a test to include coordinator error and other None errors in all partition response. We should retry it, too.

showuon · 2021-07-12T07:05:40Z

@dajac @rajinisivaram @mimaison , please help take a look. Thanks.

dajac · 2021-07-13T12:47:11Z

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

-                        partitions.put(new TopicPartition(topic.name(), partition.partitionIndex()), partitionError);
+            final Map<TopicPartition, Errors> partitionResults = new HashMap<>();
+            response.data().topics().forEach(topic ->
+                topic.partitions().forEach(partitionoffsetDeleteResponse -> {


nit: Should we keep partition instead of partitionoffsetDeleteResponse? It is a bit more concise.

Updated. Thanks.

dajac

@showuon Thanks. Left a few comments.

dajac · 2021-07-13T12:48:26Z

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

+                    Errors partitionError = Errors.forCode(partitionoffsetDeleteResponse.errorCode());
+                    TopicPartition topicPartition = new TopicPartition(topic.name(), partitionoffsetDeleteResponse.partitionIndex());
+                    if (partitionError != Errors.NONE) {
+                        handlePartitionError(groupId, partitionError, topicPartition, groupsToUnmap, groupsToRetry);


I am actually not sure about this. Looking at the code on the broker side, it seems that group errors are always returned in the top level error field. I think that we could simply return the partition errors without checking them.

Yes, I was doing the way you suggested, but there's test failed due to that change: testDeleteConsumerGroupOffsetsNumRetries in KafkaAdminClientTest. It put the NOT_COORDINATOR in partition error, and expected to retry. That's why I changed to this.
What do you think?

I see. I think that it used to work because ConsumerGroupOperationContext.hasCoordinatorMoved relied on response.errorCount(). I think that the unit test is incorrect in this case.

dajac · 2021-07-13T13:44:23Z

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

+                log.error("Received non retriable error for group {} in `{}` response", groupId,
+                    apiName(), error.exception());


Could we try to uniformize the error messages? For instance OffsetDelete request for group id {} failed due to error {}. I would also print it as debug and we don't need to provide the exception to the logger. The exception doesn't bring much here.

dajac · 2021-07-13T13:44:54Z

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

+                groupsToUnmap.add(groupId);
+                break;
+            default:
+                final String unexpectedErrorMsg = String.format("Received unexpected error for group %s in `%s` response",


unexpectedErrorMsg is not necessary as used only once. I would also follow the same partern that we use for other messages.

dajac · 2021-07-13T13:47:10Z

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

+            case COORDINATOR_LOAD_IN_PROGRESS:
+                // If the coordinator is in the middle of loading, then we just need to retry
+                log.debug("`{}` request for group {} failed because the coordinator" +
+                    " is still in the process of loading state. Will retry.", apiName(), groupId);


I am not a fan of using apiName() here because the name offsetDelete does not start with a capital letter.

dajac · 2021-07-13T13:48:15Z

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

        Map<CoordinatorKey, Map<TopicPartition, Errors>> completed = new HashMap<>();
        Map<CoordinatorKey, Throwable> failed = new HashMap<>();
-        List<CoordinatorKey> unmapped = new ArrayList<>();
+        final Set<CoordinatorKey> groupsToUnmap = new HashSet<>();


Not related to this line. Is it worth verifying that groupIds only contains the expected groupId here and in buildRequest? I did it here: https://github.com/apache/kafka/pull/11016/files#diff-72f508d8e6b9b7f8fde5de8b75bedb6e7985824b71d00fb172338ec9c4782651R121.

dajac

@showuon Thanks for the update. I left a few comments.

dajac · 2021-07-14T08:21:24Z

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

        final Errors error = Errors.forCode(response.data().errorCode());
        if (error != Errors.NONE) {
-            handleError(groupId, error, failed, unmapped);
+            handleGroupError(groupId, error, failed, groupsToUnmap, groupsToRetry);


It seems that groupsToRetry is not really necessary in this case. Moreover, we could directly return in the branch as we don't expect errors in the partitions.

if (error != Errors.NONE) { final Map<CoordinatorKey, Throwable> failed = new HashMap<>(); final Set<CoordinatorKey> groupsToUnmap = new HashSet<>(); handleGroupError(groupId, error, failed, groupsToUnmap); return new ApiResult<>(Collections.emptyMap(), failed, new ArrayList<>(groupsToUnmap); }

groupId will be either in failed or in groupsToUnmap after the call to handleGroupError.

good suggestion! Updated!

dajac · 2021-07-14T08:23:19Z

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

-            if (!partitions.isEmpty())
-                completed.put(groupId, partitions);
+
+            completed.put(groupId, partitionResults);


Could we directly return here as well?

return new ApiResult<>(Collections.singletonMap(groupId, partitionResults), Collections.emptyList(), Collections.emptyList()) ;

I think that it will make the error handling a bit more explicit.

dajac · 2021-07-14T08:23:43Z

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

-                log.error("Received non retriable error for group {} in `DeleteConsumerGroupOffsets` response", groupId,
-                        error.exception());
+            case NON_EMPTY_GROUP:
+                log.debug("`OffsetDelete` request for group id {} failed due to error {}.", groupId, error);


nit: groupId -> groupId.idValue. There are few other cases.

Nice catch! I'll also update other PRs.

dajac · 2021-07-14T08:24:00Z

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

+                break;
            case COORDINATOR_LOAD_IN_PROGRESS:
+                // If the coordinator is in the middle of loading, then we just need to retry
+                log.debug("`OffsetDelete` request for group {} failed because the coordinator" +


nit: group -> group id?

Updated. I'll also update other PRs.

dajac · 2021-07-14T08:24:04Z

...rc/main/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandler.java

-                return true;
+                // If the coordinator is unavailable or there was a coordinator change, then we unmap
+                // the key so that we retry the `FindCoordinator` request
+                log.debug("`OffsetDelete` request for group {} returned error {}. " +


nit: group -> group id?

dajac · 2021-07-14T08:25:56Z

clients/src/test/java/org/apache/kafka/clients/admin/KafkaAdminClientTest.java

+                    new OffsetDeleteResponseData()
+                        .setTopics(new OffsetDeleteResponseTopicCollection(Stream.of(
+                            new OffsetDeleteResponseTopic()
+                                .setName("foo")
+                                .setPartitions(new OffsetDeleteResponsePartitionCollection(Collections.singletonList(
+                                    new OffsetDeleteResponsePartition()
+                                        .setPartitionIndex(0)
+                                        .setErrorCode(Errors.NONE.code())
+                                ).iterator())),
+                            new OffsetDeleteResponseTopic()
+                                .setName("bar")
+                                .setPartitions(new OffsetDeleteResponsePartitionCollection(Collections.singletonList(
+                                    new OffsetDeleteResponsePartition()
+                                        .setPartitionIndex(0)
+                                        .setErrorCode(Errors.GROUP_SUBSCRIBED_TO_TOPIC.code())
+                                ).iterator()))
+                        ).collect(Collectors.toList()).iterator()))


nit: Is it really better like this? Personally, I prefer the previous indentation.

Sorry, I accidentally did it.

dajac · 2021-07-14T08:27:44Z

...est/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandlerTest.java

+                .setThrottleTimeMs(0)
+                .setTopics(new OffsetDeleteResponseTopicCollection(singletonList(
+                    new OffsetDeleteResponseTopic()
+                        .setName("t0")


nit: Could we rely on t0p0 here for the name and the partition?

dajac · 2021-07-14T08:28:18Z

...est/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandlerTest.java

+                .setThrottleTimeMs(0)
+                .setTopics(new OffsetDeleteResponseTopicCollection(singletonList(
+                    new OffsetDeleteResponseTopic()
+                        .setName("t0")


dajac · 2021-07-14T08:30:53Z

...est/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandlerTest.java

+        Collection<Map<TopicPartition, Errors>> completeCollection = result.completedKeys.values();
+        assertEquals(1, completeCollection.size());
+        Map<TopicPartition, Errors> completeMap = completeCollection.iterator().next();
+        assertEquals(expectedResult, completeMap);


You already assert that completedKeys only contains key so it seems that we could just verify that result.completedKeys.get(key) is equal to expectedResult, no?

Good suggestion! Updated.

dajac · 2021-07-14T08:31:07Z

...est/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandlerTest.java

+        assertEquals(emptyList(), result.unmappedKeys);
+        assertEquals(emptySet(), result.failedKeys.keySet());
+    }
+}


nit: Could we add the empty line back?

showuon · 2021-07-15T03:49:34Z

...est/java/org/apache/kafka/clients/admin/internals/DeleteConsumerGroupOffsetsHandlerTest.java

+            new OffsetDeleteResponseData()
+                .setThrottleTimeMs(0)
+                .setTopics(new OffsetDeleteResponseTopicCollection(singletonList(
+                    new OffsetDeleteResponseTopic()
+                        .setName(t0p0.topic())
+                        .setPartitions(new OffsetDeleteResponsePartitionCollection(singletonList(
+                            new OffsetDeleteResponsePartition()
+                                .setPartitionIndex(t0p0.partition())
+                                .setErrorCode(error.code())
+                        ).iterator()))
+                ).iterator()))
+        );


dajac

LGTM, thanks for the patch!

dajac · 2021-07-15T12:14:40Z

Failures are not related:

Build / JDK 11 and Scala 2.13 / testCommitTransactionTimeout() – kafka.api.TransactionsTest
12s
Build / JDK 11 and Scala 2.13 / shouldBeAbleToQueryFilterState – org.apache.kafka.streams.integration.QueryableStateIntegrationTest

…ATOR_NOT_AVAILABLE error (#11019) This patch improves the error handling in `DeleteConsumerGroupOffsetsHandler`. `COORDINATOR_NOT_AVAILABLE` is not unmapped to trigger a new find coordinator request to be sent out. Reviewers: David Jacot <djacot@confluent.io>

dajac · 2021-07-15T12:19:46Z

Merged to trunk and to 3.0. cc @kkonstantine

…ATOR_NOT_AVAILABLE error (apache#11019) This patch improves the error handling in `DeleteConsumerGroupOffsetsHandler`. `COORDINATOR_NOT_AVAILABLE` is not unmapped to trigger a new find coordinator request to be sent out. Reviewers: David Jacot <djacot@confluent.io>

showuon commented Jul 12, 2021

View reviewed changes

KAFKA-13059: refactor DeleteConsumerGroupOffsetsHandler and tests

68d9956

showuon force-pushed the KAFKA-13059 branch from c14c2c9 to 68d9956 Compare July 12, 2021 07:01

showuon commented Jul 12, 2021

View reviewed changes

showuon mentioned this pull request Jul 13, 2021

KAFKA-13033: COORDINATOR_NOT_AVAILABLE should be unmapped #10973

Closed

3 tasks

dajac reviewed Jul 13, 2021

View reviewed changes

KAFKA-13059: rename back to 'partition'

cbc6743

dajac reviewed Jul 13, 2021

View reviewed changes

KAFKA-13059: don't handle partition error and fix tests

68a491e

showuon force-pushed the KAFKA-13059 branch from 8757207 to 68a491e Compare July 14, 2021 02:52

showuon added 2 commits July 14, 2021 15:18

Merge branch 'trunk' of https://github.com/apache/kafka into KAFKA-13059

b717f41

KAFKA-13059: refactor

7c1cd0a

dajac reviewed Jul 14, 2021

View reviewed changes

KAFKA-13059: address comments to refactor code

4bc3d9a

showuon commented Jul 15, 2021

View reviewed changes

showuon changed the title ~~KAFKA-13059: refactor DeleteConsumerGroupOffsetsHandler and tests~~ KAFKA-13059: Make DeleteConsumerGroupOffsetsHandler unmap for COORDINATOR_NOT_AVAILABLE error and fix issue Jul 15, 2021

dajac approved these changes Jul 15, 2021

View reviewed changes

dajac merged commit 46c91f4 into apache:trunk Jul 15, 2021

		log.error("Received non retriable error for group {} in `{}` response", groupId,
		apiName(), error.exception());

KAFKA-13059: Make DeleteConsumerGroupOffsetsHandler unmap for COORDINATOR_NOT_AVAILABLE error and fix issue #11019

KAFKA-13059: Make DeleteConsumerGroupOffsetsHandler unmap for COORDINATOR_NOT_AVAILABLE error and fix issue #11019

Uh oh!

Conversation

showuon commented Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon commented Jul 12, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dajac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dajac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dajac left a comment

Choose a reason for hiding this comment

Uh oh!

dajac commented Jul 15, 2021

Uh oh!

dajac commented Jul 15, 2021

showuon commented Jul 12, 2021 •

edited

Loading