-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[processor/lsminterval] Define cardinality limits and handle overflows #235
Conversation
6aa881a
to
d47a60c
Compare
processor/lsmintervalprocessor/internal/merger/limits/tracker.go
Outdated
Show resolved
Hide resolved
processor/lsmintervalprocessor/internal/merger/limits/tracker.go
Outdated
Show resolved
Hide resolved
// Metrics doesn't have overflows (only datapoints have) | ||
// Clone it *without* the datapoint data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean we risk OOM due to high cardinality metric names? Should the limit be on time series (metric + dimensions) per scope, rather than data points per scope? Or metrics per scope and time series per metric?
processor/lsmintervalprocessor/internal/merger/limits/tracker.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level I think the approach looks good. As mentioned in a comment, I think we should have limits on all parts of the hierarchy, including metrics.
I ran the benchmarks, and it seems there's quite a drop in throughput.
Yeah, there is a considerable overhead. This is what I am working on right now -- to improve performance and drop some of the unneeded complexity around attribute matching and filtering. |
@axw I have addressed all of the comments other than one and I am marking this ready for review now. The performance consideration is still there and I have a few threads to chase down for improvements but I expect them to give minor gains, so, if you have any ideas (even rough) I will be more than happy to chase them down. One of the sources of the extra overhead is creating the lookup maps on unmarshaling the binary to the go struct -- previously I was doing this when required but now I am always doing this to simplify overflow handling a bit. I have left adding metrics limits to a follow-up PR. I will also add a few more detailed overflow tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preliminary comments, ran out of time today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, mostly minor comments/questions
Bringing this back to draft as I have a few good ideas for optimization and combined with @axw 's above comments I have a few more things to do here. (I have also pushed a different way to encode limit trackers independent of the pmetric ds) |
After the above changes, below is the benchmark diff from Benchmark diff from `main`
TL;DR, there is still ~20% performance degradation with overflows. We could eliminate the overflow path when it is not defined but that will only give us a false sense of improvement since, for our use case, we would always have overflows defined. I have another PR open here which optimizes the pebble options for our use case and that will have improvements to this PR too but the diff between with and without overflow will still be approximately same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting back up to speed after PTO - mostly minor comments
processor/lsmintervalprocessor/internal/merger/limits/tracker.go
Outdated
Show resolved
Hide resolved
processor/lsmintervalprocessor/internal/merger/limits/tracker.go
Outdated
Show resolved
Hide resolved
processor/lsmintervalprocessor/internal/merger/limits/tracker.go
Outdated
Show resolved
Hide resolved
} | ||
|
||
// Marshal marshals the tracker to a byte slice. | ||
func (t *Tracker) Marshal() ([]byte, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also be AppendBinary
? Then we could avoid a slice allocation & copy in Value.Marshal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. We won't be able to fully utilize this though since it is not possible to estimate the size of the tracker prior to marshaling, so we could still end up reallocating the slice due to axiomhq/hyperloglog#44
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this out, and the resulting code to encode trackers gets a bit messy because there is no way to calculate marshaled size beforehand. Maybe we could defer this to when we have the AppendBinary in the hll (the PR you created) and a way to estimate the size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah no worries, we can follow up on this
Co-authored-by: Andrew Wilkins <axwalk@gmail.com>
Co-authored-by: Andrew Wilkins <axwalk@gmail.com>
Co-authored-by: Andrew Wilkins <axwalk@gmail.com>
@axw I was already planning to add resource datapoints limits but now I am questioning myself if scope limits and scope datapoint limits are required or even useful, maybe just having resource limits and resource datapoints limits are enough for all use-cases? Do you see a valid case where scope limits would be useful? |
I don't think that we would encounter high-cardinality scopes, but it may still be useful to have an explicit limit for scopes independent of metrics/datapoints, to avoid high cardinality metric names or dimensions from causing scopes to be washed out. That's the same reason why I think we should have a metric limit. So what I have in mind for limits is to following the hierarchy: resource, resource-scope, resource-scope-metric, resource-scope-metric-datapoint. Why were you planning to add a resource-datapoint limit? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now, thanks for the updates. Let's resolve the discussion about limits and then get it in :)
This also sounds good to me.
The main intention for adding this was to provide a way to cater to cases when we are not sure how many scope metrics can be within a scope but we don't want to waste the limit either i.e. if we expect number of scope metrics to have a high variance we could just do resource datapoint limit and prevent losing the unutilized limits for resources with lower number of scopes... but now that I think about this, I am not sure if scope metrics are meant to be high cardinality - maybe I got ahead of the problem without thinking it is actually a problem 😓 . |
Do you want to do that in this PR, or in a followup? |
I will do it in a followup |
Related to: #141
For more details checkout #141 (comment) comment.