ref(metrics): Partition and split metrics buckets just before sending #2682

Dav1dde · 2023-10-31T14:21:40Z

Moves metric partitioning and splitting into the envelope processor. Metrics are now sliced without any allocations.

Also track metric outcomes now instead of merging buckets back into the aggregator.

Tracking outcomes for metrics has introduced a side effect that all relay generated envelops for metrics are now properly tracked which impacts the accepted metrics (relay.event.accepted) of envelopes. This will cause a considerable bump in the accepted metric, but does not actually indicate a change in the overall amount of envelopes accepted by relay.

.gitignore

iker-barriocanal · 2023-11-15T13:49:29Z

Any chance we can split this PR? 2,000+k LOC is complicated to follow in a review.

TBS1996 · 2023-11-17T07:48:07Z

tried to make a visualization

Before:

graph TD
    Project -->|RateLimitBuckets| EnvelopeProcessorService
    Project -->|MergeBuckets| AggregatorService
    EnvelopeProcessorService -->|MergeBuckets| AggregatorService
    AggregatorService -->|FlushBuckets| ProjectCacheService
    ProjectCacheService -->|SendMetrics| EnvelopeManagerService
    EnvelopeManagerService -->|SendRequest| UpstreamRelay

After:

graph TD
    Project-->|RateLimitBuckets|EnvelopeProcessorService
    Project-->|MergeBuckets|AggregatorService
    EnvelopeProcessorService-->|MergeBuckets|AggregatorService
    AggregatorService -->|FlushBuckets| ProjectCacheService
    ProjectCacheService-->|SendMetrics|EnvelopeManagerService
    EnvelopeManagerService-->|EncodeMetrics|EnvelopeProcessor
    EnvelopeProcessor-->|SubmitEnvelope|EnvelopeManagerService

edit: your newest one:

graph TD
    Project-->|RateLimitBuckets|EnvelopeProcessorService
    Project-->|MergeBuckets|AggregatorService
    EnvelopeProcessorService-->|MergeBuckets|AggregatorService
    AggregatorService -->|FlushBuckets| ProjectCacheService
    ProjectCacheService-->|EncodeMetrics|EnvelopeProcessor
    EnvelopeProcessor-->|SubmitEnvelope|EnvelopeManagerService

TBS1996

as iker said, if you're able to split this up then that would be greatly appreciated, It's hard to confidently approve a PR with so many changes

relay-metrics/src/aggregator.rs

relay-server/src/actors/processor.rs

Dav1dde · 2023-11-17T14:45:21Z

@TBS1996 @iker-barriocanal the PR is now 2 commits:

Everything related to slicing buckets into multiple smaller buckets without re-allocating and cloning
Wrapping Metrics in a ManagedEnvelope and producing outcomes

relay-config/src/config.rs

relay-server/src/utils/metrics_rate_limits.rs

relay-metrics/src/aggregator.rs

relay-server/src/actors/processor.rs

relay-server/src/utils/metrics_rate_limits.rs

relay-server/src/actors/envelopes.rs

relay-metrics/src/view.rs

relay-server/src/actors/processor.rs

TBS1996 · 2023-11-21T09:37:39Z

I think the PR is looking very good, there's a lot of code that looks a lot cleaner now. It's difficult to have everything in my working memory though, which makes it hard to approve it. Btw, it's helpful if you leave some comments on the github diff making some explanations, such as which functions you've changed and which you've just moved, or if you've both moved a variable and renamed it, it's nice to have a small note about that. It's not really necessary in smaller PRs but here it would have been very helpful.

I'll probably approve soon, in the meantime you could ping someone else to take a look too, there should hopefully be more than 1 reviewer

iker-barriocanal

This PR introduces some interesting logic, but I think it's trying to do too many things at once.

Slicing metrics without additional allocations is a good performance improvement. However:

It introduces a case that may result in panics. We should not introduce panics and do our best effort to identify and mitigate them.
It has a significant amount of additional complexity to fix a problem I'm not sure we have -- do we have a performance issue on metric bucket splitting?
I believe the previous implementation produces full metric buckets (but the last one), which seems to not be the case now. I'm not sure what the impact of this change is, but more buckets to send through the wire may have higher performance implications than splitting buckets in memory.
The complexity (not just size) makes providing a meaningful review difficult.

Let me know if I'm missing or misunderstanding something.

relay-server/src/actors/processor.rs

olksdr

Overall looks good to me.

I would suggest to avoid the code which could panic and I think we can good to go and test it in production.

CHANGELOG.md

relay-metrics/src/aggregator.rs

relay-metrics/src/view.rs

relay-server/src/utils/rate_limits.rs

Dav1dde · 2023-11-23T10:20:48Z

@iker-barriocanal sorry for the late response, some points adressed:

This PR introduces some interesting logic, but I think it's trying to do too many things at once.

It's 2 things basically:

Make the splits not clone data anymore, for this to work the next step must be Serialization to JSON. Serialization and processing should have always been in the envelope processor from the start for 2 reason:
- Serialization is CPU heavy
- Dropping the entire list of buckets is actually also quite CPU intensive
Collecting outcomes for Metrics now. This is actually a late addition, because we realized merging back the buckets is not as easy anymore because we're sending full envelopes now and in order to merge back we would have to either keep the original metric buckets around or Deserialize the envelope again, either case is not desirable. The previous implementation had a future open until the envelope was sent for each batch (!). So this was unfortunately a necessity.

There is a way to keep the splitting in the aggregator by reference counting the split buckets, but I feel like this introduces a bigger mental overhead and also it does not logically belong to the aggregator.

It introduces a case that may result in panics. We should not introduce panics and do our best effort to identify and mitigate them.

Panics are gone now.

It has a significant amount of additional complexity to fix a problem I'm not sure we have -- do we have a performance issue on metric bucket splitting?

I do not have actual numbers on this, this all depends on the volume of data. So this means with less relay instances the problem actually becomes worse.

I believe the previous implementation produces full metric buckets (but the last one), which seems to not be the case now. I'm not sure what the impact of this change is, but more buckets to send through the wire may have higher performance implications than splitting buckets in memory.

I am not sure what you mean by this. The logic should have stayed the same. If you split on the last one, there will be also a split bucket at the start of the next batch.
If there is difference in logic I consider this a bug.

relay-metrics/src/aggregator.rs

relay-server/src/utils/metrics_rate_limits.rs

relay-server/src/actors/project.rs

Dav1dde · 2023-11-27T12:21:33Z

The roundtrip through the envelope manager is gone now. Metrics are sent directly to the processor instead.

* master: (27 commits) ref(metric-meta): Add metric for total incoming metric meta (#2784) feat(server): Return global config status for downstream requests (#2765) ref(processor): Create event processor sub-module with related code (#2779) fix(metrics): Temporarily restore previous configuration keys for bucket splitting (#2780) feat(metrics): Add source context to code locations (#2781) ref(processor): Split off profile processor code into separate sub-module (#2778) ref(processor): Split off replay processing code into separate sub-module (#2776) ref(metrics): Partition and split metrics buckets just before sending (#2682) feat(spans): Allow well-known path segments in resource URLs (#2770) ref(processor): Move user and client reports processing into separate submodule (#2772) ref(crons): Include message_type in kafka message (#2723) ref(spans): Split tag mapping in specific configs (#2773) release: 23.11.2 feat(metric-meta): Normalize invalid metric names (#2769) feat(spans): Extract main_thread tag for spans (#2761) Add DE Deployments (#2746) ref(processor): Move sessions related code into separate sub-module (#2768) ref(metric-meta): Capture envelope payload in a sentry issue (#2767) release: 0.8.38 ref(normalization): Restore span processing to transactionprocessor (#2764) ...

Dav1dde self-assigned this Oct 31, 2023

Dav1dde force-pushed the ref/bucket-split branch 3 times, most recently from 5eab309 to 28e591e Compare November 2, 2023 15:31

Dav1dde changed the title ~~ref(metrics): Partition and split metrics buckets just before sending~~ ref(metrics): Split metrics buckets just before sending Nov 2, 2023

Dav1dde force-pushed the ref/bucket-split branch 2 times, most recently from 0ccce13 to 51e3fd3 Compare November 8, 2023 11:56

Dav1dde commented Nov 8, 2023

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

Dav1dde force-pushed the ref/bucket-split branch 2 times, most recently from a7a6fd6 to ce5eaea Compare November 8, 2023 13:50

Dav1dde changed the title ~~ref(metrics): Split metrics buckets just before sending~~ ref(metrics): Partition and split metrics buckets just before sending Nov 8, 2023

Dav1dde force-pushed the ref/bucket-split branch 4 times, most recently from 1116f30 to 3a01dde Compare November 10, 2023 09:50

Dav1dde marked this pull request as ready for review November 10, 2023 09:52

Dav1dde requested a review from a team November 10, 2023 09:52

Dav1dde force-pushed the ref/bucket-split branch 3 times, most recently from ea4fac9 to d8975a6 Compare November 10, 2023 10:22

TBS1996 reviewed Nov 17, 2023

View reviewed changes

relay-metrics/src/aggregator.rs Show resolved Hide resolved

relay-metrics/src/aggregator.rs Outdated Show resolved Hide resolved

relay-server/src/actors/processor.rs Outdated Show resolved Hide resolved

Dav1dde force-pushed the ref/bucket-split branch from f7d7690 to 43d469b Compare November 17, 2023 14:43

Dav1dde commented Nov 17, 2023

View reviewed changes

relay-config/src/config.rs Outdated Show resolved Hide resolved

Dav1dde force-pushed the ref/bucket-split branch 2 times, most recently from e1c7b2e to da0ebea Compare November 17, 2023 15:30

TBS1996 reviewed Nov 20, 2023

View reviewed changes

TBS1996 reviewed Nov 21, 2023

View reviewed changes

iker-barriocanal suggested changes Nov 21, 2023

View reviewed changes

relay-server/src/actors/processor.rs Show resolved Hide resolved

olksdr approved these changes Nov 22, 2023

View reviewed changes

Dav1dde added 6 commits November 23, 2023 09:47

split and partition buckets in envelope processor

6446b93

produce outcomes for metrics

a6ae73f

review: fixes mainly typos

5409f86

review: move changelog

9b871c6

don't panic

78053ca

don't go through envelope manager, default 100kib batch size processing

cdbe965

Dav1dde force-pushed the ref/bucket-split branch from da0ebea to cdbe965 Compare November 23, 2023 10:11

fix test, config value

550cd39

iker-barriocanal approved these changes Nov 23, 2023

View reviewed changes

Dav1dde mentioned this pull request Nov 24, 2023

feat(metric-meta): Add support for metric metadata #2751

Merged

TBS1996 reviewed Nov 27, 2023

View reviewed changes

relay-metrics/src/aggregator.rs Show resolved Hide resolved

relay-metrics/src/aggregator.rs Show resolved Hide resolved

relay-server/src/utils/metrics_rate_limits.rs Outdated Show resolved Hide resolved

relay-server/src/actors/project.rs Show resolved Hide resolved

correctly check for processing via the config

2509e5b

mini doc fixes

a03d9c1

TBS1996 approved these changes Nov 28, 2023

View reviewed changes

Dav1dde added 2 commits November 28, 2023 10:44

Merge remote-tracking branch 'origin/master' into ref/bucket-split

6504170

fix config in test

1f4ba8f

Dav1dde merged commit 75bf9c0 into master Nov 28, 2023

Dav1dde deleted the ref/bucket-split branch November 28, 2023 10:31

Dav1dde mentioned this pull request Nov 28, 2023

fix(metrics): Temporarily restore previous configuration keys for bucket splitting #2780

Merged

This was referenced Dec 5, 2023

ref(config): Cleanup metric config #2804

Merged

Avoid redundant split/merge when roundtripping buckets through project cache #2517

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref(metrics): Partition and split metrics buckets just before sending #2682

ref(metrics): Partition and split metrics buckets just before sending #2682

Dav1dde commented Oct 31, 2023 •

edited

Loading

iker-barriocanal commented Nov 15, 2023

TBS1996 commented Nov 17, 2023 •

edited

Loading

TBS1996 left a comment

Dav1dde commented Nov 17, 2023

TBS1996 commented Nov 21, 2023

iker-barriocanal left a comment

olksdr left a comment

Dav1dde commented Nov 23, 2023 •

edited

Loading

Dav1dde commented Nov 27, 2023

ref(metrics): Partition and split metrics buckets just before sending #2682

ref(metrics): Partition and split metrics buckets just before sending #2682

Conversation

Dav1dde commented Oct 31, 2023 • edited Loading

iker-barriocanal commented Nov 15, 2023

TBS1996 commented Nov 17, 2023 • edited Loading

TBS1996 left a comment

Choose a reason for hiding this comment

Dav1dde commented Nov 17, 2023

TBS1996 commented Nov 21, 2023

iker-barriocanal left a comment

Choose a reason for hiding this comment

olksdr left a comment

Choose a reason for hiding this comment

Dav1dde commented Nov 23, 2023 • edited Loading

Dav1dde commented Nov 27, 2023

Dav1dde commented Oct 31, 2023 •

edited

Loading

TBS1996 commented Nov 17, 2023 •

edited

Loading

Dav1dde commented Nov 23, 2023 •

edited

Loading