-
Notifications
You must be signed in to change notification settings - Fork 14.9k
KAFKA-13059: Make DeleteConsumerGroupOffsetsHandler unmap for COORDINATOR_NOT_AVAILABLE error and fix issue #11019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put every error into partitionResults, as the log logic did
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to #11016, we don't return any completed/failed results if we need to retry.
| public void testDeleteConsumerGroupOffsets() throws Exception { | ||
| // Happy path | ||
|
|
||
| public void testDeleteConsumerGroupOffsetsResponseIncludeCoordinatorErrorAndNoneError() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test to include coordinator error and other None errors in all partition response. We should retry it, too.
|
@dajac @rajinisivaram @mimaison , please help take a look. Thanks. |
| partitions.put(new TopicPartition(topic.name(), partition.partitionIndex()), partitionError); | ||
| final Map<TopicPartition, Errors> partitionResults = new HashMap<>(); | ||
| response.data().topics().forEach(topic -> | ||
| topic.partitions().forEach(partitionoffsetDeleteResponse -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should we keep partition instead of partitionoffsetDeleteResponse? It is a bit more concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks.
dajac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@showuon Thanks. Left a few comments.
| Errors partitionError = Errors.forCode(partitionoffsetDeleteResponse.errorCode()); | ||
| TopicPartition topicPartition = new TopicPartition(topic.name(), partitionoffsetDeleteResponse.partitionIndex()); | ||
| if (partitionError != Errors.NONE) { | ||
| handlePartitionError(groupId, partitionError, topicPartition, groupsToUnmap, groupsToRetry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually not sure about this. Looking at the code on the broker side, it seems that group errors are always returned in the top level error field. I think that we could simply return the partition errors without checking them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was doing the way you suggested, but there's test failed due to that change: testDeleteConsumerGroupOffsetsNumRetries in KafkaAdminClientTest. It put the NOT_COORDINATOR in partition error, and expected to retry. That's why I changed to this.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think that it used to work because ConsumerGroupOperationContext.hasCoordinatorMoved relied on response.errorCount(). I think that the unit test is incorrect in this case.
| log.error("Received non retriable error for group {} in `{}` response", groupId, | ||
| apiName(), error.exception()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we try to uniformize the error messages? For instance OffsetDelete request for group id {} failed due to error {}. I would also print it as debug and we don't need to provide the exception to the logger. The exception doesn't bring much here.
| groupsToUnmap.add(groupId); | ||
| break; | ||
| default: | ||
| final String unexpectedErrorMsg = String.format("Received unexpected error for group %s in `%s` response", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unexpectedErrorMsg is not necessary as used only once. I would also follow the same partern that we use for other messages.
| case COORDINATOR_LOAD_IN_PROGRESS: | ||
| // If the coordinator is in the middle of loading, then we just need to retry | ||
| log.debug("`{}` request for group {} failed because the coordinator" + | ||
| " is still in the process of loading state. Will retry.", apiName(), groupId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of using apiName() here because the name offsetDelete does not start with a capital letter.
| Map<CoordinatorKey, Map<TopicPartition, Errors>> completed = new HashMap<>(); | ||
| Map<CoordinatorKey, Throwable> failed = new HashMap<>(); | ||
| List<CoordinatorKey> unmapped = new ArrayList<>(); | ||
| final Set<CoordinatorKey> groupsToUnmap = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this line. Is it worth verifying that groupIds only contains the expected groupId here and in buildRequest? I did it here: https://github.com/apache/kafka/pull/11016/files#diff-72f508d8e6b9b7f8fde5de8b75bedb6e7985824b71d00fb172338ec9c4782651R121.
dajac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@showuon Thanks for the update. I left a few comments.
| final Errors error = Errors.forCode(response.data().errorCode()); | ||
| if (error != Errors.NONE) { | ||
| handleError(groupId, error, failed, unmapped); | ||
| handleGroupError(groupId, error, failed, groupsToUnmap, groupsToRetry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that groupsToRetry is not really necessary in this case. Moreover, we could directly return in the branch as we don't expect errors in the partitions.
if (error != Errors.NONE) {
final Map<CoordinatorKey, Throwable> failed = new HashMap<>();
final Set<CoordinatorKey> groupsToUnmap = new HashSet<>();
handleGroupError(groupId, error, failed, groupsToUnmap);
return new ApiResult<>(Collections.emptyMap(), failed, new ArrayList<>(groupsToUnmap);
}
groupId will be either in failed or in groupsToUnmap after the call to handleGroupError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion! Updated!
| if (!partitions.isEmpty()) | ||
| completed.put(groupId, partitions); | ||
|
|
||
| completed.put(groupId, partitionResults); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we directly return here as well?
return new ApiResult<>(Collections.singletonMap(groupId, partitionResults), Collections.emptyList(), Collections.emptyList()) ;
I think that it will make the error handling a bit more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
| log.error("Received non retriable error for group {} in `DeleteConsumerGroupOffsets` response", groupId, | ||
| error.exception()); | ||
| case NON_EMPTY_GROUP: | ||
| log.debug("`OffsetDelete` request for group id {} failed due to error {}.", groupId, error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: groupId -> groupId.idValue. There are few other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! I'll also update other PRs.
| break; | ||
| case COORDINATOR_LOAD_IN_PROGRESS: | ||
| // If the coordinator is in the middle of loading, then we just need to retry | ||
| log.debug("`OffsetDelete` request for group {} failed because the coordinator" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: group -> group id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I'll also update other PRs.
| return true; | ||
| // If the coordinator is unavailable or there was a coordinator change, then we unmap | ||
| // the key so that we retry the `FindCoordinator` request | ||
| log.debug("`OffsetDelete` request for group {} returned error {}. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: group -> group id?
| new OffsetDeleteResponseData() | ||
| .setTopics(new OffsetDeleteResponseTopicCollection(Stream.of( | ||
| new OffsetDeleteResponseTopic() | ||
| .setName("foo") | ||
| .setPartitions(new OffsetDeleteResponsePartitionCollection(Collections.singletonList( | ||
| new OffsetDeleteResponsePartition() | ||
| .setPartitionIndex(0) | ||
| .setErrorCode(Errors.NONE.code()) | ||
| ).iterator())), | ||
| new OffsetDeleteResponseTopic() | ||
| .setName("bar") | ||
| .setPartitions(new OffsetDeleteResponsePartitionCollection(Collections.singletonList( | ||
| new OffsetDeleteResponsePartition() | ||
| .setPartitionIndex(0) | ||
| .setErrorCode(Errors.GROUP_SUBSCRIBED_TO_TOPIC.code()) | ||
| ).iterator())) | ||
| ).collect(Collectors.toList()).iterator())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is it really better like this? Personally, I prefer the previous indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I accidentally did it.
| .setThrottleTimeMs(0) | ||
| .setTopics(new OffsetDeleteResponseTopicCollection(singletonList( | ||
| new OffsetDeleteResponseTopic() | ||
| .setName("t0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we rely on t0p0 here for the name and the partition?
| .setThrottleTimeMs(0) | ||
| .setTopics(new OffsetDeleteResponseTopicCollection(singletonList( | ||
| new OffsetDeleteResponseTopic() | ||
| .setName("t0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
| Collection<Map<TopicPartition, Errors>> completeCollection = result.completedKeys.values(); | ||
| assertEquals(1, completeCollection.size()); | ||
| Map<TopicPartition, Errors> completeMap = completeCollection.iterator().next(); | ||
| assertEquals(expectedResult, completeMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already assert that completedKeys only contains key so it seems that we could just verify that result.completedKeys.get(key) is equal to expectedResult, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! Updated.
| assertEquals(emptyList(), result.unmappedKeys); | ||
| assertEquals(emptySet(), result.failedKeys.keySet()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we add the empty line back?
| new OffsetDeleteResponseData() | ||
| .setThrottleTimeMs(0) | ||
| .setTopics(new OffsetDeleteResponseTopicCollection(singletonList( | ||
| new OffsetDeleteResponseTopic() | ||
| .setName(t0p0.topic()) | ||
| .setPartitions(new OffsetDeleteResponsePartitionCollection(singletonList( | ||
| new OffsetDeleteResponsePartition() | ||
| .setPartitionIndex(t0p0.partition()) | ||
| .setErrorCode(error.code()) | ||
| ).iterator())) | ||
| ).iterator())) | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent fix
dajac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the patch!
|
Failures are not related: |
…ATOR_NOT_AVAILABLE error (#11019) This patch improves the error handling in `DeleteConsumerGroupOffsetsHandler`. `COORDINATOR_NOT_AVAILABLE` is not unmapped to trigger a new find coordinator request to be sent out. Reviewers: David Jacot <djacot@confluent.io>
|
Merged to trunk and to 3.0. cc @kkonstantine |
…ATOR_NOT_AVAILABLE error (apache#11019) This patch improves the error handling in `DeleteConsumerGroupOffsetsHandler`. `COORDINATOR_NOT_AVAILABLE` is not unmapped to trigger a new find coordinator request to be sent out. Reviewers: David Jacot <djacot@confluent.io>
Some issues found in the
DeleteConsumerGroupOffsetsHandler:coordinator errorsis put in the topic partition, plus a Errors.NONE, we'll failed withIllegalArgumentException: Partition foo was not included in the original request. This is the new added test case scenario:testDeleteConsumerGroupOffsetsResponseIncludeCoordinatorErrorAndNoneErrorDeleteConsumerGroupOffsetsHandlerTest, we build all errors in partition result, including group error. Split group error tests and partition error tests.This is the old handle response logic. FYR:
Committer Checklist (excluded from commit message)