feat(plugin-server): Preserve distinct ID locality on overflow rerouting #20945

tkaemming · 2024-03-15T05:13:29Z

Problem

When messages are rerouted from the main topic to the overflow topic, we've historically stripped them of their key so that they are uniformly distributed over all partitions within the overflow topic, rather than being targeted at a single partition. This is helpful for distributing CPU-intensive workloads or other workloads that are not reliant on shared state, as it allows the independently parallelizable aspects of the processing loop to be performed independently, resulting in overall throughput improvements.

However, the hotspots in our workload these days seem to typically be person property updates, which require row level locks in Postgres to be held while performing row updates. This can cause problems when paired with messages that are uniformly distributed over all of the partitions in the overflow topic:

When a distinct ID that is frequently seen in the main topic overflows and is rerouted, the number of processes that need to acquire the lock suddenly jumps from the single consumer for the associated partition in the main topic, to all consumers over all partitions in the overflow topic. This leads to lock contention and decreased throughput across the overflow consumers as they need to continually wait on each other to make forward progress. This can lead to slow batches and liveness check timeouts (which themselves can cause disruptive restarts/rebalances.)
When the distinct ID stops overflowing in the main topic (either due to a decrease in ingress rate, or due to consumer group rebalances due to scale up/down, crashes, pod eviction, etc) but a backlog of messages to be processed remains in the overflow topic, this lock contention can leak back "upstream" to the main topic consumer, as now the main consumer also needs to acquire the heavily contended row locks, in addition to the overflow consumers. This typically causes a situation that requires manual intervention to resolve (generally forcing to overflow or dropping the distinct ID.)

Changes

Preserve the message key when publishing to the overflow topic, so that semantic routing/locality is preserved over distinct IDs.

This isn't a perfect solution, it does have it's tradeoffs:

Advantages

Improved throughput due to less lock contention when we've been slammed by one distinct ID
Decreased pressure reflecting back upstream to ingestion since, fewer overflow consumers will be contending for locks
Conducive to person property update batching (which doesn't exist yet, but hopefully will in the near future)

Disadvantages

Single partition backlog can grow much faster now in overflow (we'll need to be careful with retention.bytes to avoid losing messages to retention if one partition is extremely overloaded)
In the unfortunate event a slightly overflowing distinct ID is colocated with a heavily overflowing distinct ID, it will be more delayed than it would be if it was in a different partition (however still likely less delayed than had this change not been made at all)
Ingestion will be less resilient to slow plugins

Does this work well for both Cloud and self-hosted?

Yep!

How did you test this code?

Updated tests.

xvello · 2024-03-19T09:45:30Z

plugin-server/src/main/ingestion-queues/batch-processing/each-batch-ingestion.ts

@@ -255,7 +255,7 @@ async function emitToOverflow(queue: IngestionConsumer, kafkaMessages: Message[]
            queue.pluginsServer.kafkaProducer.produce({
                topic: KAFKA_EVENTS_PLUGIN_INGESTION_OVERFLOW,
                value: message.value,
-                key: null, // No locality guarantees in overflow
+                key: message.key,


message.key can be empty if capture-side overflow detection triggers.

As we want to evaluate impact and probably don't want to invest a lot for now, I'm fine with:

checking how much capture-side detection triggers (on both new and old capture) vs plugin-server-side

disable capture-side detection on both captures for now while we evalutate this

Last monday's incident has shown us that we can read & unmarshall really fast (1.6M/minute with 8 historical pods dropping on token), so capture-side might not be really necessary.

The alternative would be to re-compute a key if missing, but then that's a third copy of that code to maintain, I'd rather avoid it.

Going to hold off on merging this for a bit — I had assumed there was already a method to turn off overflow routing wholesale on both captures but it doesn't look like that was a valid assumption for me to make. Seems like it'd make sense to take care of that first.

It can be turned off in python, but you are right that it cannot be turned off in rust yet. I'd put very high thresholds in the rust config while working on a PR to add a boolean config to completely disable it.
BTW, this capture-side detection is kind of a scalability time bomb as its memory usage is O(active distinct_id), so it needs to eventually be phased out anyway.

It can be turned off in python, but you are right that it cannot be turned off in rust yet.

Ah, I was thinking we'd want to bypass this entire conditional, but maybe that's overkill.

BTW, this capture-side detection is kind of a scalability time bomb as its memory usage is O(active distinct_id), so it needs to eventually be phased out anyway.

Good point, I hadn't really considered that — thanks for mentioning it.

Good call, we'd need to skip the check against LIKELY_ANONYMOUS_IDS too

Ability to turn off random partitioning completely in old capture is here: #21168 (My initial thought was to keep that PR separate, but in retrospect, I suppose this could have been part of this change too.)

plugin-server/src/main/ingestion-queues/batch-processing/each-batch-ingestion.ts

bretthoerner · 2024-03-19T15:09:00Z

Well written description. Seems like a win to me. In the future, with more reliability visibility, we should (auto) blackhole extremely loud distinct_ids to alleviate the risk of single partition becoming a huge issue.

…based configuration.

…Randomly` mode.

… reverse compatibility and no surprises during deploy

…on overflow rerouting (#20945)" This reverts commit 85ef237.

… on overflow rerouting (#20945)" (#21279) This reverts commit 85ef237.

tiina303 · 2024-04-02T17:15:21Z

plugin-server/src/main/ingestion-queues/batch-processing/each-batch-ingestion.ts

    ingestionOverflowingMessagesTotal.inc(kafkaMessages.length)
    await Promise.all(
        kafkaMessages.map((message) =>
            queue.pluginsServer.kafkaProducer.produce({
                topic: KAFKA_EVENTS_PLUGIN_INGESTION_OVERFLOW,
                value: message.value,
-                key: null, // No locality guarantees in overflow
+                key: useRandomPartitioner ? undefined : message.key,


there's a change here where we set it to undefined instead of null

As far as I can tell, this was the issue.

The produce call here eventually forwards to a HighLevelProducer instance, which does some extra stuff with the key in produce. If the key ends up being undefined it just quietly doesn't produce the message, and the callback is never called.

This is also the behavior that I saw testing this manually in the node REPL:

> const { HighLevelProducer } = require('node-rdkafka') undefined > const p = new HighLevelProducer({'bootstrap.servers': 'kafka:9092'}) undefined > p.connect() // truncated > > p.produce('garbage-topic', null, Buffer.from('message'), 'key', undefined, (...a) => { console.log('callback:', a) }) undefined > callback: [ null, 1 ] > p.produce('garbage-topic', null, Buffer.from('message'), undefined, undefined, (...a) => { console.log('callback:', a) }) undefined // nothing ever happens here

What I figure happened is that any message that should have been routed to overflow never resolved it's promise, and the consumers simply stopped making forward progress once they saw a batch containing one of those messages.

The key property is typed as MessageKey, which does include undefined, and the HighLevelProducer.produce signature accepts any, which explains why this wasn't caught by the type checker.

…low rerouting (#20945) Turned off by default for backwards compatibility for now.

tkaemming force-pushed the overflow-preserve-locality branch from 6e3c7b7 to 0de49eb Compare March 18, 2024 23:27

feat: Preserve distinct ID locality on overflow rerouting

bedda6b

tkaemming force-pushed the overflow-preserve-locality branch from 0de49eb to bedda6b Compare March 18, 2024 23:45

tkaemming marked this pull request as ready for review March 19, 2024 00:30

tkaemming requested a review from a team March 19, 2024 00:30

xvello reviewed Mar 19, 2024

View reviewed changes

tkaemming added 3 commits March 20, 2024 17:32

Replace isIngestionOverflowEnabled helper with standard(ish) Hub-…

4c85f5b

…based configuration.

Add INGESTION_OVERFLOW_PRESERVE_PARTITION_LOCALITY and add `Reroute…

c1ad17e

…Randomly` mode.

Tidy up comment that referenced old configruation function name.

b196d21

tkaemming requested a review from a team March 22, 2024 18:01

xvello approved these changes Mar 25, 2024

View reviewed changes

This was referenced Mar 26, 2024

refactor: Make capture overflow partitioning code a bit more straightforward #21145

Closed

feat(capture): Add setting to be able to disable capture overflow entirely #21168

Merged

Default INGESTION_OVERFLOW_PRESERVE_PARTITION_LOCALITY to false for…

dc6dd49

… reverse compatibility and no surprises during deploy

tkaemming mentioned this pull request Mar 29, 2024

feat(capture): Make overflow limiter optional via configuration PostHog/hog-rs#19

Closed

Merge branch 'master' into overflow-preserve-locality

057597d

tkaemming merged commit 85ef237 into master Apr 2, 2024
77 checks passed

tkaemming deleted the overflow-preserve-locality branch April 2, 2024 15:56

tkaemming added a commit that referenced this pull request Apr 2, 2024

Revert "feat(plugin-server): Support preserving distinct ID locality …

6a102c2

…on overflow rerouting (#20945)" This reverts commit 85ef237.

tkaemming added a commit that referenced this pull request Apr 2, 2024

revert: "feat(plugin-server): Support preserving distinct ID locality…

4457dd5

… on overflow rerouting (#20945)" (#21279) This reverts commit 85ef237.

tiina303 reviewed Apr 2, 2024

View reviewed changes

tkaemming mentioned this pull request Apr 3, 2024

test(plugin-server): Improve the robustness of overflow handling tests #21315

Merged

tkaemming added a commit that referenced this pull request Apr 5, 2024

feat(plugin-server): Support preserving distinct ID locality on overf…

b3faaba

…low rerouting (#20945) Turned off by default for backwards compatibility for now.

tkaemming mentioned this pull request Apr 5, 2024

feat(plugin-server): Preserve distinct ID locality on overflow rerouting #21358

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(plugin-server): Preserve distinct ID locality on overflow rerouting #20945

feat(plugin-server): Preserve distinct ID locality on overflow rerouting #20945

tkaemming commented Mar 15, 2024 •

edited

Loading

xvello Mar 19, 2024 •

edited

Loading

tkaemming Mar 25, 2024

xvello Mar 26, 2024

tkaemming Mar 26, 2024 •

edited

Loading

xvello Mar 26, 2024

tkaemming Mar 27, 2024

bretthoerner commented Mar 19, 2024 •

edited

Loading

tiina303 Apr 2, 2024 •

edited

Loading

tkaemming Apr 3, 2024

feat(plugin-server): Preserve distinct ID locality on overflow rerouting #20945

feat(plugin-server): Preserve distinct ID locality on overflow rerouting #20945

Conversation

tkaemming commented Mar 15, 2024 • edited Loading

Problem

Changes

Advantages

Disadvantages

Does this work well for both Cloud and self-hosted?

How did you test this code?

xvello Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

tkaemming Mar 25, 2024

Choose a reason for hiding this comment

xvello Mar 26, 2024

Choose a reason for hiding this comment

tkaemming Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

xvello Mar 26, 2024

Choose a reason for hiding this comment

tkaemming Mar 27, 2024

Choose a reason for hiding this comment

bretthoerner commented Mar 19, 2024 • edited Loading

tiina303 Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

tkaemming Apr 3, 2024

Choose a reason for hiding this comment

tkaemming commented Mar 15, 2024 •

edited

Loading

xvello Mar 19, 2024 •

edited

Loading

tkaemming Mar 26, 2024 •

edited

Loading

bretthoerner commented Mar 19, 2024 •

edited

Loading

tiina303 Apr 2, 2024 •

edited

Loading