[processor/lsminterval] Define cardinality limits and handle overflows #235

lahsivjar · 2024-11-30T00:00:50Z

Related to: #141

For more details checkout #141 (comment) comment.

processor/lsmintervalprocessor/config/config.go

processor/lsmintervalprocessor/internal/merger/limits/tracker.go

axw · 2024-12-03T07:42:38Z

processor/lsmintervalprocessor/internal/merger/limits/store.go

+	// Metrics doesn't have overflows (only datapoints have)
+	// Clone it *without* the datapoint data


Doesn't this mean we risk OOM due to high cardinality metric names? Should the limit be on time series (metric + dimensions) per scope, rather than data points per scope? Or metrics per scope and time series per metric?

processor/lsmintervalprocessor/internal/merger/limits/tracker.go

axw

At a high level I think the approach looks good. As mentioned in a comment, I think we should have limits on all parts of the hierarchy, including metrics.

I ran the benchmarks, and it seems there's quite a drop in throughput.

processor/lsmintervalprocessor/processor.go

processor/lsmintervalprocessor/internal/merger/model.go

lahsivjar · 2024-12-04T16:37:47Z

I ran the benchmarks, and it seems there's quite a drop in throughput.

Yeah, there is a considerable overhead. This is what I am working on right now -- to improve performance and drop some of the unneeded complexity around attribute matching and filtering.

lahsivjar · 2024-12-10T19:29:10Z

@axw I have addressed all of the comments other than one and I am marking this ready for review now. The performance consideration is still there and I have a few threads to chase down for improvements but I expect them to give minor gains, so, if you have any ideas (even rough) I will be more than happy to chase them down. One of the sources of the extra overhead is creating the lookup maps on unmarshaling the binary to the go struct -- previously I was doing this when required but now I am always doing this to simplify overflow handling a bit.

I have left adding metrics limits to a follow-up PR. I will also add a few more detailed overflow tests.

axw

Preliminary comments, ran out of time today

processor/lsmintervalprocessor/internal/merger/limits/tracker.go

processor/lsmintervalprocessor/config/config.go

processor/lsmintervalprocessor/internal/merger/merger.go

axw

Looks good overall, mostly minor comments/questions

processor/lsmintervalprocessor/config/config.go

processor/lsmintervalprocessor/internal/merger/value.go

lahsivjar · 2024-12-18T19:10:11Z

Bringing this back to draft as I have a few good ideas for optimization and combined with @axw 's above comments I have a few more things to do here.

(I have also pushed a different way to encode limit trackers independent of the pmetric ds)

lahsivjar · 2024-12-24T12:52:05Z

After the above changes, below is the benchmark diff from main:

Benchmark diff from `main`

name                                            old time/op    new time/op    delta
Aggregation/sum_cumulative-10                     10.0µs ± 5%    12.2µs ± 5%  +22.80%  (p=0.008 n=5+5)
Aggregation/sum_delta-10                          10.0µs ± 3%    12.3µs ± 2%  +23.51%  (p=0.008 n=5+5)
Aggregation/histogram_cumulative-10               10.1µs ± 3%    12.8µs ± 4%  +26.10%  (p=0.008 n=5+5)
Aggregation/histogram_delta-10                    10.4µs ± 2%    12.9µs ± 2%  +24.41%  (p=0.008 n=5+5)
Aggregation/exphistogram_cumulative-10            11.1µs ± 2%    12.7µs ± 1%  +14.86%  (p=0.008 n=5+5)
Aggregation/exphistogram_delta-10                 11.2µs ± 2%    13.4µs ± 8%  +19.06%  (p=0.008 n=5+5)
Aggregation/summary_enabled-10                    10.3µs ± 1%    13.2µs ± 4%  +28.69%  (p=0.008 n=5+5)
Aggregation/summary_passthrough-10                1.30µs ± 2%    1.39µs ± 6%   +6.82%  (p=0.048 n=5+5)
AggregationWithOTTL/sum_cumulative-10             12.0µs ± 3%    14.4µs ± 6%  +19.39%  (p=0.008 n=5+5)
AggregationWithOTTL/sum_delta-10                  12.2µs ± 4%    14.3µs ± 4%  +16.75%  (p=0.008 n=5+5)
AggregationWithOTTL/histogram_cumulative-10       12.5µs ± 0%    14.4µs ± 7%  +15.09%  (p=0.016 n=4+5)
AggregationWithOTTL/histogram_delta-10            11.8µs ± 7%    15.2µs ± 6%  +28.66%  (p=0.008 n=5+5)
AggregationWithOTTL/exphistogram_cumulative-10    13.2µs ± 6%    14.3µs ± 5%   +8.42%  (p=0.016 n=5+5)
AggregationWithOTTL/exphistogram_delta-10         13.2µs ± 4%    14.9µs ± 1%  +13.16%  (p=0.008 n=5+5)
AggregationWithOTTL/summary_enabled-10            12.7µs ± 5%    14.1µs ± 3%  +11.16%  (p=0.008 n=5+5)
AggregationWithOTTL/summary_passthrough-10        1.40µs ± 5%    1.32µs ± 3%   -5.87%  (p=0.032 n=5+5)

name                                            old alloc/op   new alloc/op   delta
Aggregation/sum_cumulative-10                     17.6kB ± 3%    23.8kB ± 1%  +35.18%  (p=0.016 n=5+4)
Aggregation/sum_delta-10                          17.6kB ± 1%    23.9kB ± 1%  +35.23%  (p=0.008 n=5+5)
Aggregation/histogram_cumulative-10               19.4kB ± 2%    24.7kB ± 5%  +27.10%  (p=0.008 n=5+5)
Aggregation/histogram_delta-10                    19.4kB ± 1%    24.7kB ± 5%  +27.42%  (p=0.008 n=5+5)
Aggregation/exphistogram_cumulative-10            21.7kB ± 0%    24.2kB ± 1%  +11.70%  (p=0.008 n=5+5)
Aggregation/exphistogram_delta-10                 22.2kB ± 1%    24.9kB ± 1%  +12.11%  (p=0.008 n=5+5)
Aggregation/summary_enabled-10                    18.5kB ± 1%    21.9kB ± 4%  +18.36%  (p=0.016 n=4+5)
Aggregation/summary_passthrough-10                1.38kB ± 0%    1.38kB ± 0%   +0.52%  (p=0.032 n=5+5)
AggregationWithOTTL/sum_cumulative-10             20.2kB ± 1%    23.6kB ± 2%  +16.63%  (p=0.008 n=5+5)
AggregationWithOTTL/sum_delta-10                  20.2kB ± 1%    23.7kB ± 1%  +17.35%  (p=0.008 n=5+5)
AggregationWithOTTL/histogram_cumulative-10       22.1kB ± 3%    26.9kB ± 2%  +21.82%  (p=0.008 n=5+5)
AggregationWithOTTL/histogram_delta-10            21.9kB ± 2%    26.5kB ± 0%  +20.91%  (p=0.008 n=5+5)
AggregationWithOTTL/exphistogram_cumulative-10    23.0kB ± 5%    26.8kB ± 1%  +16.58%  (p=0.008 n=5+5)
AggregationWithOTTL/exphistogram_delta-10         23.6kB ± 3%    27.5kB ± 1%  +16.38%  (p=0.008 n=5+5)
AggregationWithOTTL/summary_enabled-10            20.7kB ± 4%    24.3kB ± 2%  +17.67%  (p=0.008 n=5+5)
AggregationWithOTTL/summary_passthrough-10        1.38kB ± 1%    1.38kB ± 0%     ~     (p=0.159 n=5+5)

name                                            old allocs/op  new allocs/op  delta
Aggregation/sum_cumulative-10                        199 ± 1%       235 ± 0%  +18.41%  (p=0.008 n=5+5)
Aggregation/sum_delta-10                             200 ± 0%       237 ± 0%  +18.06%  (p=0.008 n=5+5)
Aggregation/histogram_cumulative-10                  211 ± 1%       246 ± 0%  +16.67%  (p=0.008 n=5+5)
Aggregation/histogram_delta-10                       213 ± 0%       248 ± 0%  +16.34%  (p=0.008 n=5+5)
Aggregation/exphistogram_cumulative-10               228 ± 0%       264 ± 0%  +15.79%  (p=0.008 n=5+5)
Aggregation/exphistogram_delta-10                    237 ± 0%       273 ± 0%  +15.19%  (p=0.016 n=5+4)
Aggregation/summary_enabled-10                       212 ± 0%       248 ± 0%  +16.98%  (p=0.029 n=4+4)
Aggregation/summary_passthrough-10                  37.0 ± 0%      37.0 ± 0%     ~     (all equal)
AggregationWithOTTL/sum_cumulative-10                227 ± 0%       263 ± 0%  +15.66%  (p=0.008 n=5+5)
AggregationWithOTTL/sum_delta-10                     229 ± 0%       264 ± 0%  +15.28%  (p=0.016 n=4+5)
AggregationWithOTTL/histogram_cumulative-10          244 ± 1%       281 ± 2%  +15.33%  (p=0.008 n=5+5)
AggregationWithOTTL/histogram_delta-10               246 ± 1%       279 ± 2%  +13.07%  (p=0.008 n=5+5)
AggregationWithOTTL/exphistogram_cumulative-10       256 ± 0%       292 ± 0%  +14.06%  (p=0.029 n=4+4)
AggregationWithOTTL/exphistogram_delta-10            265 ± 0%       301 ± 0%  +13.58%  (p=0.016 n=4+5)
AggregationWithOTTL/summary_enabled-10               240 ± 0%       275 ± 0%  +14.58%  (p=0.008 n=5+5)
AggregationWithOTTL/summary_passthrough-10          37.0 ± 0%      37.0 ± 0%     ~     (all equal)

TL;DR, there is still ~20% performance degradation with overflows. We could eliminate the overflow path when it is not defined but that will only give us a false sense of improvement since, for our use case, we would always have overflows defined.

I have another PR open here which optimizes the pebble options for our use case and that will have improvements to this PR too but the diff between with and without overflow will still be approximately same.

axw

Getting back up to speed after PTO - mostly minor comments

processor/lsmintervalprocessor/internal/merger/datapoints.go

processor/lsmintervalprocessor/internal/merger/limits/tracker.go

axw · 2025-01-06T06:23:21Z

processor/lsmintervalprocessor/internal/merger/limits/tracker.go

+}
+
+// Marshal marshals the tracker to a byte slice.
+func (t *Tracker) Marshal() ([]byte, error) {


Should this also be AppendBinary? Then we could avoid a slice allocation & copy in Value.Marshal

Sounds good. We won't be able to fully utilize this though since it is not possible to estimate the size of the tracker prior to marshaling, so we could still end up reallocating the slice due to axiomhq/hyperloglog#44

I tried this out, and the resulting code to encode trackers gets a bit messy because there is no way to calculate marshaled size beforehand. Maybe we could defer this to when we have the AppendBinary in the hll (the PR you created) and a way to estimate the size?

Yeah no worries, we can follow up on this

processor/lsmintervalprocessor/internal/merger/value.go

Co-authored-by: Andrew Wilkins <axwalk@gmail.com>

lahsivjar · 2025-01-06T19:58:22Z

@axw I was already planning to add resource datapoints limits but now I am questioning myself if scope limits and scope datapoint limits are required or even useful, maybe just having resource limits and resource datapoints limits are enough for all use-cases? Do you see a valid case where scope limits would be useful?

axw · 2025-01-07T02:50:53Z

@axw I was already planning to add resource datapoints limits but now I am questioning myself if scope limits and scope datapoint limits are required or even useful, maybe just having resource limits and resource datapoints limits are enough for all use-cases? Do you see a valid case where scope limits would be useful?

I don't think that we would encounter high-cardinality scopes, but it may still be useful to have an explicit limit for scopes independent of metrics/datapoints, to avoid high cardinality metric names or dimensions from causing scopes to be washed out. That's the same reason why I think we should have a metric limit.

So what I have in mind for limits is to following the hierarchy: resource, resource-scope, resource-scope-metric, resource-scope-metric-datapoint.

Why were you planning to add a resource-datapoint limit?

axw

Looks good now, thanks for the updates. Let's resolve the discussion about limits and then get it in :)

lahsivjar · 2025-01-07T11:48:15Z

So what I have in mind for limits is to following the hierarchy: resource, resource-scope, resource-scope-metric, resource-scope-metric-datapoint.

This also sounds good to me.

Why were you planning to add a resource-datapoint limit?

The main intention for adding this was to provide a way to cater to cases when we are not sure how many scope metrics can be within a scope but we don't want to waste the limit either i.e. if we expect number of scope metrics to have a high variance we could just do resource datapoint limit and prevent losing the unutilized limits for resources with lower number of scopes... but now that I think about this, I am not sure if scope metrics are meant to be high cardinality - maybe I got ahead of the problem without thinking it is actually a problem 😓 .

axw · 2025-01-08T03:53:20Z

So what I have in mind for limits is to following the hierarchy: resource, resource-scope, resource-scope-metric, resource-scope-metric-datapoint.

This also sounds good to me.

Do you want to do that in this PR, or in a followup?

lahsivjar · 2025-01-08T08:42:08Z

Do you want to do that in this PR, or in a followup?

I will do it in a followup

[processor/lsminterval] Define cardinality limits and handle overflows

d47a60c

lahsivjar force-pushed the lsminterval-limits branch from 6aa881a to d47a60c Compare November 30, 2024 00:02

lahsivjar changed the title ~~Lsminterval limits~~ [processor/lsminterval] Define cardinality limits and handle overflows Nov 30, 2024

simplify store config

d3d0a3b

axw reviewed Dec 3, 2024

View reviewed changes

axw reviewed Dec 4, 2024

View reviewed changes

lahsivjar added 8 commits December 5, 2024 12:41

Simplify overflow handling and address review comments

b9cbdb9

Encapsulate limits and tracker encoding

970c87f

Collapse store implementation into value

18cc82a

Fix overflow merging

5c1475d

fail fast

2124f9f

Error check

694cd7f

Simplify tracker by removing generics

6e5c292

Minor refactor

4cc7701

lahsivjar marked this pull request as ready for review December 10, 2024 19:26

lahsivjar requested a review from a team as a code owner December 10, 2024 19:26

Minor refactor and go docs

d5f0875

lahsivjar requested a review from axw December 10, 2024 20:19

lahsivjar added 2 commits December 10, 2024 20:22

minor refactor

7362069

make lint

2c0be01

axw reviewed Dec 11, 2024

View reviewed changes

processor/lsmintervalprocessor/internal/merger/limits/tracker.go Show resolved Hide resolved

processor/lsmintervalprocessor/config/config.go Outdated Show resolved Hide resolved

processor/lsmintervalprocessor/internal/merger/merger.go Outdated Show resolved Hide resolved

lahsivjar added 3 commits December 11, 2024 15:52

Add validation for tracker unmarshaling

088b1dc

Add godoc for overflow configuration

2169725

ptr reciever for merger

6280612

axw reviewed Dec 12, 2024

View reviewed changes

lahsivjar added 2 commits December 17, 2024 12:41

Rename marshal unmarshal

294df0b

Optimize tracker encoding

9f050ad

lahsivjar added 10 commits December 23, 2024 14:58

Better comment

9f48bd2

Optimize hash usage for overflow

537153e

Use streamID for dps overflow estimator

5340bdd

Rename internal functions

f2f2a07

Optimize double lookups

7a6b37f

Add schema url to resource and scope overflow

808a46b

make lint

620847f

Merge branch 'main' into lsminterval-limits

b97a19b

Allocate maps as required

ef42e64

Refactor datapoints into separate file

4ac4234

lahsivjar marked this pull request as ready for review December 24, 2024 12:48

lahsivjar requested a review from axw December 24, 2024 12:49

axw reviewed Jan 6, 2025

View reviewed changes

lahsivjar and others added 6 commits January 6, 2025 09:34

Update processor/lsmintervalprocessor/internal/merger/limits/tracker.go

e0782d8

Co-authored-by: Andrew Wilkins <axwalk@gmail.com>

Update processor/lsmintervalprocessor/internal/merger/limits/tracker.go

66a0fe1

Co-authored-by: Andrew Wilkins <axwalk@gmail.com>

Update processor/lsmintervalprocessor/internal/merger/limits/tracker.go

4286586

Co-authored-by: Andrew Wilkins <axwalk@gmail.com>

Refactor addDP and its usage

1a1d2e8

Merge branch 'main' into lsminterval-limits

7a7e249

Marshal version info

4446f97

axw reviewed Jan 7, 2025

View reviewed changes

Merge branch 'main' into lsminterval-limits

e4432b1

axw approved these changes Jan 8, 2025

View reviewed changes

lahsivjar merged commit 0f49e60 into elastic:main Jan 8, 2025
11 checks passed

lahsivjar deleted the lsminterval-limits branch January 8, 2025 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/lsminterval] Define cardinality limits and handle overflows #235

[processor/lsminterval] Define cardinality limits and handle overflows #235

lahsivjar commented Nov 30, 2024 •

edited

Loading

axw Dec 3, 2024

axw left a comment

lahsivjar commented Dec 4, 2024

lahsivjar commented Dec 10, 2024 •

edited

Loading

axw left a comment

axw left a comment

lahsivjar commented Dec 18, 2024

lahsivjar commented Dec 24, 2024 •

edited

Loading

axw left a comment

axw Jan 6, 2025

lahsivjar Jan 6, 2025 •

edited

Loading

lahsivjar Jan 6, 2025

axw Jan 7, 2025

lahsivjar commented Jan 6, 2025 •

edited

Loading

axw commented Jan 7, 2025

axw left a comment

lahsivjar commented Jan 7, 2025

axw commented Jan 8, 2025

lahsivjar commented Jan 8, 2025

		// Metrics doesn't have overflows (only datapoints have)
		// Clone it without the datapoint data

[processor/lsminterval] Define cardinality limits and handle overflows #235

[processor/lsminterval] Define cardinality limits and handle overflows #235

Conversation

lahsivjar commented Nov 30, 2024 • edited Loading

axw Dec 3, 2024

Choose a reason for hiding this comment

axw left a comment

Choose a reason for hiding this comment

lahsivjar commented Dec 4, 2024

lahsivjar commented Dec 10, 2024 • edited Loading

axw left a comment

Choose a reason for hiding this comment

axw left a comment

Choose a reason for hiding this comment

lahsivjar commented Dec 18, 2024

lahsivjar commented Dec 24, 2024 • edited Loading

axw left a comment

Choose a reason for hiding this comment

axw Jan 6, 2025

Choose a reason for hiding this comment

lahsivjar Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

lahsivjar Jan 6, 2025

Choose a reason for hiding this comment

axw Jan 7, 2025

Choose a reason for hiding this comment

lahsivjar commented Jan 6, 2025 • edited Loading

axw commented Jan 7, 2025

axw left a comment

Choose a reason for hiding this comment

lahsivjar commented Jan 7, 2025

axw commented Jan 8, 2025

lahsivjar commented Jan 8, 2025

lahsivjar commented Nov 30, 2024 •

edited

Loading

lahsivjar commented Dec 10, 2024 •

edited

Loading

lahsivjar commented Dec 24, 2024 •

edited

Loading

lahsivjar Jan 6, 2025 •

edited

Loading

lahsivjar commented Jan 6, 2025 •

edited

Loading