Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(metrics): Partition and split metrics buckets just before sending #2682

Merged
merged 11 commits into from
Nov 28, 2023

Conversation

Dav1dde
Copy link
Member

@Dav1dde Dav1dde commented Oct 31, 2023

Moves metric partitioning and splitting into the envelope processor. Metrics are now sliced without any allocations.

Also track metric outcomes now instead of merging buckets back into the aggregator.

Tracking outcomes for metrics has introduced a side effect that all relay generated envelops for metrics are now properly tracked which impacts the accepted metrics (relay.event.accepted) of envelopes. This will cause a considerable bump in the accepted metric, but does not actually indicate a change in the overall amount of envelopes accepted by relay.

@Dav1dde Dav1dde self-assigned this Oct 31, 2023
@Dav1dde Dav1dde force-pushed the ref/bucket-split branch 3 times, most recently from 5eab309 to 28e591e Compare November 2, 2023 15:31
@Dav1dde Dav1dde changed the title ref(metrics): Partition and split metrics buckets just before sending ref(metrics): Split metrics buckets just before sending Nov 2, 2023
@Dav1dde Dav1dde force-pushed the ref/bucket-split branch 2 times, most recently from 0ccce13 to 51e3fd3 Compare November 8, 2023 11:56
.gitignore Outdated Show resolved Hide resolved
@Dav1dde Dav1dde force-pushed the ref/bucket-split branch 2 times, most recently from a7a6fd6 to ce5eaea Compare November 8, 2023 13:50
@Dav1dde Dav1dde changed the title ref(metrics): Split metrics buckets just before sending ref(metrics): Partition and split metrics buckets just before sending Nov 8, 2023
@Dav1dde Dav1dde force-pushed the ref/bucket-split branch 4 times, most recently from 1116f30 to 3a01dde Compare November 10, 2023 09:50
@Dav1dde Dav1dde marked this pull request as ready for review November 10, 2023 09:52
@Dav1dde Dav1dde requested a review from a team November 10, 2023 09:52
@Dav1dde Dav1dde force-pushed the ref/bucket-split branch 3 times, most recently from ea4fac9 to d8975a6 Compare November 10, 2023 10:22
@iker-barriocanal
Copy link
Contributor

Any chance we can split this PR? 2,000+k LOC is complicated to follow in a review.

@TBS1996
Copy link
Contributor

TBS1996 commented Nov 17, 2023

tried to make a visualization

Before:

graph TD
    Project -->|RateLimitBuckets| EnvelopeProcessorService
    Project -->|MergeBuckets| AggregatorService
    EnvelopeProcessorService -->|MergeBuckets| AggregatorService
    AggregatorService -->|FlushBuckets| ProjectCacheService
    ProjectCacheService -->|SendMetrics| EnvelopeManagerService
    EnvelopeManagerService -->|SendRequest| UpstreamRelay

Loading

After:

graph TD
    Project-->|RateLimitBuckets|EnvelopeProcessorService
    Project-->|MergeBuckets|AggregatorService
    EnvelopeProcessorService-->|MergeBuckets|AggregatorService
    AggregatorService -->|FlushBuckets| ProjectCacheService
    ProjectCacheService-->|SendMetrics|EnvelopeManagerService
    EnvelopeManagerService-->|EncodeMetrics|EnvelopeProcessor
    EnvelopeProcessor-->|SubmitEnvelope|EnvelopeManagerService


Loading

edit: your newest one:

graph TD
    Project-->|RateLimitBuckets|EnvelopeProcessorService
    Project-->|MergeBuckets|AggregatorService
    EnvelopeProcessorService-->|MergeBuckets|AggregatorService
    AggregatorService -->|FlushBuckets| ProjectCacheService
    ProjectCacheService-->|EncodeMetrics|EnvelopeProcessor
    EnvelopeProcessor-->|SubmitEnvelope|EnvelopeManagerService


Loading

Copy link
Contributor

@TBS1996 TBS1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as iker said, if you're able to split this up then that would be greatly appreciated, It's hard to confidently approve a PR with so many changes

relay-metrics/src/aggregator.rs Show resolved Hide resolved
relay-metrics/src/aggregator.rs Outdated Show resolved Hide resolved
relay-server/src/actors/processor.rs Outdated Show resolved Hide resolved
@Dav1dde
Copy link
Member Author

Dav1dde commented Nov 17, 2023

@TBS1996 @iker-barriocanal the PR is now 2 commits:

  1. Everything related to slicing buckets into multiple smaller buckets without re-allocating and cloning
  2. Wrapping Metrics in a ManagedEnvelope and producing outcomes

relay-config/src/config.rs Outdated Show resolved Hide resolved
@Dav1dde Dav1dde force-pushed the ref/bucket-split branch 2 times, most recently from e1c7b2e to da0ebea Compare November 17, 2023 15:30
relay-server/src/utils/metrics_rate_limits.rs Show resolved Hide resolved
relay-metrics/src/aggregator.rs Show resolved Hide resolved
relay-server/src/actors/processor.rs Show resolved Hide resolved
relay-server/src/utils/metrics_rate_limits.rs Show resolved Hide resolved
relay-server/src/actors/envelopes.rs Outdated Show resolved Hide resolved
relay-metrics/src/view.rs Outdated Show resolved Hide resolved
relay-metrics/src/view.rs Outdated Show resolved Hide resolved
relay-metrics/src/view.rs Outdated Show resolved Hide resolved
relay-metrics/src/view.rs Outdated Show resolved Hide resolved
relay-metrics/src/view.rs Outdated Show resolved Hide resolved
relay-metrics/src/view.rs Show resolved Hide resolved
relay-metrics/src/view.rs Outdated Show resolved Hide resolved
relay-metrics/src/view.rs Outdated Show resolved Hide resolved
relay-metrics/src/view.rs Outdated Show resolved Hide resolved
relay-server/src/actors/processor.rs Show resolved Hide resolved
@TBS1996
Copy link
Contributor

TBS1996 commented Nov 21, 2023

I think the PR is looking very good, there's a lot of code that looks a lot cleaner now. It's difficult to have everything in my working memory though, which makes it hard to approve it. Btw, it's helpful if you leave some comments on the github diff making some explanations, such as which functions you've changed and which you've just moved, or if you've both moved a variable and renamed it, it's nice to have a small note about that. It's not really necessary in smaller PRs but here it would have been very helpful.

I'll probably approve soon, in the meantime you could ping someone else to take a look too, there should hopefully be more than 1 reviewer

Copy link
Contributor

@iker-barriocanal iker-barriocanal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces some interesting logic, but I think it's trying to do too many things at once.

Slicing metrics without additional allocations is a good performance improvement. However:

  • It introduces a case that may result in panics. We should not introduce panics and do our best effort to identify and mitigate them.
  • It has a significant amount of additional complexity to fix a problem I'm not sure we have -- do we have a performance issue on metric bucket splitting?
  • I believe the previous implementation produces full metric buckets (but the last one), which seems to not be the case now. I'm not sure what the impact of this change is, but more buckets to send through the wire may have higher performance implications than splitting buckets in memory.
  • The complexity (not just size) makes providing a meaningful review difficult.

Let me know if I'm missing or misunderstanding something.

relay-server/src/actors/processor.rs Show resolved Hide resolved
Copy link
Contributor

@olksdr olksdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me.

I would suggest to avoid the code which could panic and I think we can good to go and test it in production.

CHANGELOG.md Outdated Show resolved Hide resolved
relay-metrics/src/aggregator.rs Outdated Show resolved Hide resolved
relay-metrics/src/aggregator.rs Outdated Show resolved Hide resolved
relay-metrics/src/view.rs Outdated Show resolved Hide resolved
relay-metrics/src/view.rs Outdated Show resolved Hide resolved
relay-metrics/src/view.rs Show resolved Hide resolved
relay-metrics/src/view.rs Outdated Show resolved Hide resolved
relay-metrics/src/view.rs Outdated Show resolved Hide resolved
relay-metrics/src/view.rs Outdated Show resolved Hide resolved
relay-server/src/utils/rate_limits.rs Outdated Show resolved Hide resolved
@Dav1dde
Copy link
Member Author

Dav1dde commented Nov 23, 2023

@iker-barriocanal sorry for the late response, some points adressed:

This PR introduces some interesting logic, but I think it's trying to do too many things at once.

It's 2 things basically:

  • Make the splits not clone data anymore, for this to work the next step must be Serialization to JSON. Serialization and processing should have always been in the envelope processor from the start for 2 reason:
    • Serialization is CPU heavy
    • Dropping the entire list of buckets is actually also quite CPU intensive
  • Collecting outcomes for Metrics now. This is actually a late addition, because we realized merging back the buckets is not as easy anymore because we're sending full envelopes now and in order to merge back we would have to either keep the original metric buckets around or Deserialize the envelope again, either case is not desirable. The previous implementation had a future open until the envelope was sent for each batch (!). So this was unfortunately a necessity.

There is a way to keep the splitting in the aggregator by reference counting the split buckets, but I feel like this introduces a bigger mental overhead and also it does not logically belong to the aggregator.

  • It introduces a case that may result in panics. We should not introduce panics and do our best effort to identify and mitigate them.

Panics are gone now.

  • It has a significant amount of additional complexity to fix a problem I'm not sure we have -- do we have a performance issue on metric bucket splitting?

I do not have actual numbers on this, this all depends on the volume of data. So this means with less relay instances the problem actually becomes worse.

  • I believe the previous implementation produces full metric buckets (but the last one), which seems to not be the case now. I'm not sure what the impact of this change is, but more buckets to send through the wire may have higher performance implications than splitting buckets in memory.

I am not sure what you mean by this. The logic should have stayed the same. If you split on the last one, there will be also a split bucket at the start of the next batch.
If there is difference in logic I consider this a bug.

relay-metrics/src/aggregator.rs Show resolved Hide resolved
relay-metrics/src/aggregator.rs Show resolved Hide resolved
relay-server/src/utils/metrics_rate_limits.rs Outdated Show resolved Hide resolved
relay-server/src/actors/project.rs Show resolved Hide resolved
@Dav1dde
Copy link
Member Author

Dav1dde commented Nov 27, 2023

The roundtrip through the envelope manager is gone now. Metrics are sent directly to the processor instead.

@Dav1dde Dav1dde merged commit 75bf9c0 into master Nov 28, 2023
20 checks passed
@Dav1dde Dav1dde deleted the ref/bucket-split branch November 28, 2023 10:31
jan-auer added a commit that referenced this pull request Nov 29, 2023
* master: (27 commits)
  ref(metric-meta): Add metric for total incoming metric meta (#2784)
  feat(server): Return global config status for downstream requests (#2765)
  ref(processor): Create event processor sub-module with related code (#2779)
  fix(metrics): Temporarily restore previous configuration keys for bucket splitting (#2780)
  feat(metrics): Add source context to code locations (#2781)
  ref(processor): Split off profile processor code into separate sub-module (#2778)
  ref(processor): Split off replay processing code into separate sub-module (#2776)
  ref(metrics): Partition and split metrics buckets just before sending (#2682)
  feat(spans): Allow well-known path segments in resource URLs (#2770)
  ref(processor): Move user and client reports processing into separate submodule (#2772)
  ref(crons): Include message_type in kafka message (#2723)
  ref(spans): Split tag mapping in specific configs (#2773)
  release: 23.11.2
  feat(metric-meta): Normalize invalid metric names (#2769)
  feat(spans): Extract main_thread tag for spans (#2761)
  Add DE Deployments (#2746)
  ref(processor): Move sessions related code into separate sub-module (#2768)
  ref(metric-meta): Capture envelope payload in a sentry issue (#2767)
  release: 0.8.38
  ref(normalization): Restore span processing to transactionprocessor (#2764)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants