Change Kafka Lookup Extractor to not register consumer group #12842

cloventt · 2022-08-01T04:08:53Z

Description

The Kafka lookup extractor has to consume an entire topic from the beginning in order to build the internal lookup map. Previously, the extractor would always use a randomly generated Kafka group.id. This meant that the service would register a new consumer group every time it started, essentially "forgetting" it's previously committed consumer offsets. This guarantees that the service will always consume the entire topic.

This has the unintended side-effect of also leaving a lot of "ghost" consumers registered with the Kafka cluster. These consumer groups will never be used again and so they just hang around on the broker until Kafka decides to delete them (by default, after 2 days). This needlessly adds bloat to the Kafka broker.

This has been fixed by setting the Kafka consumer config enable.auto.commit to false. This means that the consumer never attempts to commit offsets, achieving the same result as before without leaving a bunch of "ghost" consumer groups registered on the broker.

I also took the chance to flesh out the documentation a whole bunch.

Key changed/added classes in this PR

org.apache.druid.query.lookup.KafkaLookupExtractorFactory

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

The current behaviour of the Kafka lookup extractor is to not commit offsets by assigning a unique ID to the consumer group and setting auto.offset.reset to earliest. This does the job but also pollutes the Kafka broker with a bunch of "ghost" consumer groups that will never again be used. To fix this, we now set enable.auto.commit to false, which prevents the ghost consumer groups being created in the first place.

Provide some additional detail on functionality and configuration. Hopefully this will make it clearer how the extractor works for developers who aren't so familiar with Kafka.

abhishekagarwal87 · 2022-08-01T13:22:48Z

docs/development/extensions-core/kafka-extraction-namespace.md

-See [lookups](../../querying/lookups.md) for how to configure and use lookups.
+The extractor works by consuming the configured Kafka topic from the beginning, and appending every record to an internal map. The key of the Kafka record is used as they key of the map, and the payload of the record is used as the value. At query time, a lookup can be used to transform the key into the associated value. See [lookups](../../querying/lookups.md) for how to configure and use lookups in a query. Keys and values are both stored as strings by the lookup extractor.
+
+The extractor remains subscribed to the topic, so new records are added to the lookup map as they appear. This allows for lookup values to be updated in near-realtime. If two records are added to the topic with the same key, the record with the larger offset will replace the previous record in the lookup map. A record with a `null` payload will be treated as a tombstone record, and the associated key will be removed from the lookup map.


the PR, that has the code to remove a payload with null message, is yet to be merged.

Yes, this change will need to wait for the PR to be merged. I will resolve any merge conflicts when that happens.

abhishekagarwal87 · 2022-08-01T13:26:16Z

thank you for these changes. can you look into the CI failures? they look legit failures.

cloventt added 4 commits August 1, 2022 15:06

update docs to include new enable.auto.commit setting behaviour

f79d5fd

update kafka-lookup-extractor documentation

8b52460

Provide some additional detail on functionality and configuration. Hopefully this will make it clearer how the extractor works for developers who aren't so familiar with Kafka.

add comments better explaining the logic of the code

c4c3068

abhishekagarwal87 reviewed Aug 1, 2022

View reviewed changes

clintropolis added the Area - Lookups label Aug 1, 2022

add spelling exceptions for kafka lookup docs

152f8ef

cloventt requested a review from abhishekagarwal87 August 2, 2022 08:05

abhishekagarwal87 approved these changes Aug 3, 2022

View reviewed changes

abhishekagarwal87 mentioned this pull request Aug 3, 2022

Remove kafka lookup records when a record is tombstoned #12819

Merged

9 tasks

abhishekagarwal87 merged commit 2855fb6 into apache:master Aug 9, 2022

abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022

techdocsmith mentioned this pull request Aug 26, 2022

[Draft] 24.0 Release notes #12825

Closed

abhishekagarwal87 mentioned this pull request Sep 8, 2022

Test issue [Please ignore] #13055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change Kafka Lookup Extractor to not register consumer group #12842

Change Kafka Lookup Extractor to not register consumer group #12842

cloventt commented Aug 1, 2022

abhishekagarwal87 Aug 1, 2022

cloventt Aug 2, 2022

abhishekagarwal87 commented Aug 1, 2022

Change Kafka Lookup Extractor to not register consumer group #12842

Change Kafka Lookup Extractor to not register consumer group #12842

Conversation

cloventt commented Aug 1, 2022

Description

Key changed/added classes in this PR

abhishekagarwal87 Aug 1, 2022

Choose a reason for hiding this comment

cloventt Aug 2, 2022

Choose a reason for hiding this comment

abhishekagarwal87 commented Aug 1, 2022