Skip to content

Conversation

@mumrah
Copy link
Member

@mumrah mumrah commented Aug 19, 2019

Remember the updateVersion of the last update to Metadata so we can avoid unnecessarily checking each partition for leader epoch changes on every call to KafkaConsumer#poll

@mumrah mumrah changed the title KAFKA-8806 Only check for leader changes when there is new metadata KAFKA-8806 Reduce calls to validateOffsetsIfNeeded Aug 19, 2019
Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left a couple comments.

* Create fetch requests for all nodes for which we have assigned partitions
* that have no existing requests in flight.
*/
private Map<Node, FetchSessionHandler.FetchRequestData> prepareFetchRequests() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetchablePartitions method used below is probably another nice opportunity to use something like forEachAssignedPartition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it saves anything in this case. The main benefit of forEachAssignedPartition is avoiding making a copy of the assignment set. Since fetchablePartitions iterates across the internal set directly I don't think it would help

Copy link
Member

@ijuma ijuma Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid a copy in the case @hachikuji mentions as well, right? See below:

synchronized List<TopicPartition> fetchablePartitions(Predicate<TopicPartition> isAvailable) {
        return assignment.stream()
                .filter(tpState -> isAvailable.test(tpState.topicPartition()) && tpState.value().isFetchable())
                .map(PartitionStates.PartitionState::topicPartition)
                .collect(Collectors.toList());
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I see what he means. I'll look into this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a pass at this and it wasn't so simple. Deferring for now

@ijuma
Copy link
Member

ijuma commented Aug 20, 2019

There are a number of test failures that could be related.

@mumrah
Copy link
Member Author

mumrah commented Aug 20, 2019

@ijuma, they are related, looking at them now.

mumrah added 4 commits August 20, 2019 10:33
Previouslly, this would only update the offset and rely on future calls to
Fetcher#maybeValidatePositionForCurrentLeader to get the leader information.
Now that we are only calling maybeValidatePositionForCurrentLeader when the
metadata has updated, we would get stuck after a reset.
Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM. I'll merge after the build completes.

@mumrah
Copy link
Member Author

mumrah commented Aug 21, 2019

retest this please

@ijuma
Copy link
Member

ijuma commented Aug 21, 2019

Two jobs timed out, one had flaky test failures:

kafka.api.GroupEndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl
kafka.api.SaslSslAdminClientIntegrationTest.testDescribeLogDirs
kafka.api.SaslSslAdminClientIntegrationTest.testAlterReplicaLogDirs
org.apache.kafka.connect.integration.RebalanceSourceConnectorsIntegrationTest.testReconfigConnector
kafka.api.DelegationTokenEndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl

@ijuma
Copy link
Member

ijuma commented Aug 21, 2019

retest this please

1 similar comment
@hachikuji
Copy link
Contributor

retest this please

@mumrah
Copy link
Member Author

mumrah commented Aug 22, 2019

retest this please

2 similar comments
@hachikuji
Copy link
Contributor

retest this please

@ijuma
Copy link
Member

ijuma commented Aug 26, 2019

retest this please

@ijuma
Copy link
Member

ijuma commented Aug 26, 2019

It's a bit concerning that the tests are so flaky in this PR, have we been checking the failures to see if they're related (before Jenkins deletes them)?

@hachikuji
Copy link
Contributor

retest this please

@mumrah
Copy link
Member Author

mumrah commented Aug 5, 2020

Reviving this PR cc @hachikuji @ijuma @andrewchoi5

@mumrah
Copy link
Member Author

mumrah commented Aug 5, 2020

@mumrah
Copy link
Member Author

mumrah commented Aug 5, 2020

retest this please

@mumrah
Copy link
Member Author

mumrah commented Aug 5, 2020

image

@mumrah
Copy link
Member Author

mumrah commented Aug 5, 2020

retest this please

.filter(tpState -> isAvailable.test(tpState.topicPartition()) && tpState.value().isFetchable())
.map(PartitionStates.PartitionState::topicPartition)
.collect(Collectors.toList());
List<TopicPartition> result = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a small comment that this is in the hotpath and is written the "ugly" way for a reason. It's also probably worth mentioning that we do the cheap isFetchable check first.

@mumrah
Copy link
Member Author

mumrah commented Aug 11, 2020

Here are some results from the new JMH added (note that the units are milliseconds)

KAFKA-8806:

Benchmark                                                   (partitionCount)  (topicCount)  Mode  Cnt   Score   Error  Units
SubscriptionStateBenchmark.testFetchablePartitions                        50          5000  avgt   15  13.758 ± 0.477  ms/op
SubscriptionStateBenchmark.testHasAllFetchPositions                       50          5000  avgt   15   9.131 ± 0.056  ms/op
SubscriptionStateBenchmark.testPartitionsNeedingValidation                50          5000  avgt   15  11.869 ± 0.403  ms/op
JMH benchmarks done


Trunk:

Benchmark                                                   (partitionCount)  (topicCount)  Mode  Cnt   Score   Error  Units
SubscriptionStateBenchmark.testFetchablePartitions                        50          5000  avgt   15  19.249 ± 0.117  ms/op
SubscriptionStateBenchmark.testHasAllFetchPositions                       50          5000  avgt   15  17.025 ± 1.421  ms/op
SubscriptionStateBenchmark.testPartitionsNeedingValidation                50          5000  avgt   15  13.291 ± 0.646  ms/op
JMH benchmarks done

So for very high partition counts, there seems to be a decent improvement

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I did not review the benchmark at depth, but it looks reasonable. For the future, I think we can consider smarter bookkeeping to avoid the need for these loops in the first place. We can create an index in SubscriptionState which is keyed by the state of the partition so so that we do not need a pass to discover the resetting/validating partitions (for example).

For the particular case of validation, I am also looking forward to the improvements that are possible with the Fetch changes in KIP-595. Basically validation can be piggybacked on the Fetch API and we can avoid the need for a separate validating state.

return false;
}
}
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumrah I now understand why the previous version was slower, it was allocating a PartitionState instance per element. But we only use the value here. So, we could still use allMatch without the performance penalty.

@mumrah mumrah merged commit 1a96974 into apache:trunk Aug 21, 2020
FrankYang0529 pushed a commit that referenced this pull request Aug 20, 2025
Remove unused PartitionState. It was unused after #7222.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, PoAn Yang
 <payang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants