Skip to content

Conversation

@dajac
Copy link
Member

@dajac dajac commented Jul 10, 2021

While reviewing #10973, I have noticed that AlterConsumerGroupOffsetsHandler does not handle partition errors correctly. The issue is that any partition error fails the entire future instead of being passed as an error for its corresponding partition. KafkaAdminClientTest#testOffsetCommitWithMultipleErrors reproduces the bug.

The regression was introduced by KIP-699.

Context: #10973 (comment).

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@dajac dajac changed the title [WIP] Update AlterConsumerGroupOffsetsHandler to handle errors correctly. KAFKA-13058; AlterConsumerGroupOffsetsHandler does not handle partition errors correctly. Jul 11, 2021
@dajac dajac marked this pull request as ready for review July 11, 2021 09:49
@dajac dajac requested review from mimaison and rajinisivaram July 11, 2021 09:49
@dajac
Copy link
Member Author

dajac commented Jul 11, 2021

I am reviewing the other handlers which have been introduced in KIP-699. I will follow-up with other PRs if necessary.

Copy link
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dajac , thanks for your nice catch and quick fix! I confirmed that the handleResponse in AlterConsumerGroupOffsetsHandler has the same logic as before. That is, we'll always return each partition result (even with error), except coordinator related errors (that needs retry or unmap).

Here is the previous code for other reviewer's reference:

 void handleResponse(AbstractResponse abstractResponse) {
      final OffsetCommitResponse response = (OffsetCommitResponse) abstractResponse;
  
      Map<Errors, Integer> errorCounts = response.errorCounts();
      // 1) If coordinator changed since we fetched it, retry
      // 2) If there is a coordinator error, retry
      if (ConsumerGroupOperationContext.hasCoordinatorMoved(errorCounts) ||
          ConsumerGroupOperationContext.shouldRefreshCoordinator(errorCounts)) {
          Call call = getAlterConsumerGroupOffsetsCall(context, offsets);
          rescheduleFindCoordinatorTask(context, () -> call, this);
          return;
      }
  
      final Map<TopicPartition, Errors> partitions = new HashMap<>();
      for (OffsetCommitResponseTopic topic : response.data().topics()) {
          for (OffsetCommitResponsePartition partition : topic.partitions()) {
              TopicPartition tp = new TopicPartition(topic.name(), partition.partitionIndex());
              Errors error = Errors.forCode(partition.errorCode());
              partitions.put(tp, error);
          }
      }
      context.future().complete(partitions);
  }

Thank you.

} else if (!topicPartitions.containsKey(partition)) {
result.completeExceptionally(new IllegalArgumentException(
"Alter offset for partition \"" + partition + "\" was not attempted"));
this.future.whenComplete((topicPartitions, throwable) -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a change to lambda expression, no content change, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

"Failed altering consumer group offsets for the following partitions: " + partitionsFailed);
}
return this.future.thenApply(topicPartitionErrorsMap -> {
List<TopicPartition> partitionsFailed = topicPartitionErrorsMap.entrySet()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to lambda expression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

Copy link
Contributor

@rajinisivaram rajinisivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dajac Thanks for the PR. Left a few minor comments/questions, apart from that LGTM

topics.add(topic);
private void validateKeys(
Set<CoordinatorKey> groupIds
) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we put args on the same line unless the list is too big

case GROUP_AUTHORIZATION_FAILED:
log.debug("OffsetCommit request for group id {} failed due to error {}.",
groupId.idValue, error);
partitionResults.put(topicPartition, error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should REBALANCE_IN_PROGRESS and GROUP_AUTHORIZATION_FAILED be added to groupsToRetry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good question. Prior to KIP-599, we considered them as non retryable errors so I sticked to this here. I think that it might be a good idea to consider them as retryable errors but we should do it consistently for all the group handlers. How about filing a Jira for this and tackling it separately?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can open a JIRA to do it later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Group level errors.
case INVALID_GROUP_ID:
case REBALANCE_IN_PROGRESS:
case INVALID_COMMIT_OFFSET_SIZE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a group-level error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is. It basically indicate that we could write the group metadata to the log so it concerns the group. https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L448


return new ApiResult<>(completed, failed, unmapped);
if (groupsToUnmap.isEmpty() && groupsToRetry.isEmpty()) {
return new ApiResult<>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use ApiResult.completed()

Collections.emptyList()
);
} else {
return new ApiResult<>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use ApiResult.unmapped()

@dajac
Copy link
Member Author

dajac commented Jul 19, 2021

@rajinisivaram Thanks for your comments. I have replied to them add/or addressed them.

Copy link
Contributor

@rajinisivaram rajinisivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dajac Thanks for the updates, LGTM

@dajac
Copy link
Member Author

dajac commented Jul 19, 2021

Failed tests are not related:

Build / ARM / org.apache.kafka.clients.ClientUtilsTest.testResolveDnsLookup()
Build / ARM / org.apache.kafka.clients.ClientUtilsTest.testResolveDnsLookup()
Build / JDK 16 and Scala 2.13 / org.apache.kafka.clients.ClientUtilsTest.testResolveDnsLookup()
Build / JDK 16 and Scala 2.13 / org.apache.kafka.clients.ClientUtilsTest.testResolveDnsLookup()
Build / JDK 8 and Scala 2.12 / org.apache.kafka.clients.ClientUtilsTest.testResolveDnsLookup()
Build / JDK 8 and Scala 2.12 / org.apache.kafka.clients.ClientUtilsTest.testResolveDnsLookup()
Build / JDK 8 and Scala 2.12 / kafka.server.epoch.LeaderEpochIntegrationTest.shouldAddCurrentLeaderEpochToMessagesAsTheyAreWrittenToLeader()
Build / JDK 11 and Scala 2.13 / org.apache.kafka.clients.ClientUtilsTest.testResolveDnsLookup()
Build / JDK 11 and Scala 2.13 / org.apache.kafka.clients.ClientUtilsTest.testResolveDnsLookup()
Build / JDK 16 and Scala 2.13 / kafka.api.ConsumerBounceTest.testCloseDuringRebalance()
Build / JDK 16 and Scala 2.13 / kafka.api.ConsumerBounceTest.testCloseDuringRebalance()

@dajac dajac merged commit f7790a0 into apache:trunk Jul 19, 2021
dajac added a commit that referenced this pull request Jul 19, 2021
…on errors correctly. (#11016)

This patch updates `AlterConsumerGroupOffsetsHandler` to handle partition errors correctly. The issue is that any partition error fails the entire future instead of being passed as an error for its corresponding partition. 

Reviewers: Luke Chen <showuon@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
@dajac dajac deleted the minor-kip-699 branch July 19, 2021 14:49
@dajac
Copy link
Member Author

dajac commented Jul 19, 2021

Merged to trunk and 3.0.

xdgrulez pushed a commit to xdgrulez/kafka that referenced this pull request Dec 22, 2021
…on errors correctly. (apache#11016)

This patch updates `AlterConsumerGroupOffsetsHandler` to handle partition errors correctly. The issue is that any partition error fails the entire future instead of being passed as an error for its corresponding partition. 

Reviewers: Luke Chen <showuon@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants