-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Pre Compute Aggregations with Star Tree index #12498
Comments
Prototype DetailsFor the prototype, ‘DocValuesFormat’ in Lucene is extended to create Star Tree index with support for ‘Numeric’ fields as the dimensions of the Star Tree. Timestamp field is rounded off to various granularity of epoch such as ‘minute’, ‘hour’, ‘day’ etc. and are also added as dimensions to the Star Tree. Fig: Star Tree Index Structure Star Tree indexingStar tree index structure
Creation FlowStar Tree index is created during Lucene ‘flush’ operation.
Merge Flow
Lucene index files
TODO: Create an issue with Lucene on the new formats. Star Tree query
POC codebharath-techie@1d44e1a - OpenSearch code changes bharath-techie/lucene@c993b9a - Lucene code changes |
Benchmark ResultsOpenSearch Setup
Benchmark Setup
Workload
Star Tree configDimensions
Metric
Search (primarily Aggregations)Here is a summary of how Star Tree queries fared against the aggregation queries on present day indices. [against Points index mainly ]
Observations
IndexingGiven that the aggregation now happens during the indexing, there are tradeoffs w.r.t. storage, indexing throughput, cumulative times during indexing etc. which are captured below.
Observations
Store size with respect to cardinalityFollowing captures store size with respect to cardinality of timestamp (as timestamp has one of the highest cardinalities and is present in all scenarios)
Hence super high cardinality fields must be avoided as part of Star Tree index. |
Only a 1000x performance improvement? :) These are awesome results @bharath-techie! One minor note, I'd encourage you to link any prototype code you have on your user fork in case folks want to dive into the implementation.
I'm sure there are lots of deep technical issues to dive into here, but I just want to highlight that to my knowledge we don't actually have a true "append-only index" abstraction in OpenSearch today. Data streams are partially that: only "create" operations are allowed when referencing the data stream entity. However, the underlying indexes that make up a data stream are still exposed to users and allow updates and deletes. I'm not sure if that is more of a side effect of the implementation, or if users truly need the option to update/delete documents on a limited basis even for a log analytics-style workload. |
We should be able to apply the optimization on a per-segment basis, and fall back to the classic behavior on segments with deletes, right @bharath-techie? Hopefully not all segments would have deletes. Hopefully unmerged deletes would be a transient issue (especially now that it's safe to expunge deletes). Failing that, I kicked off a thread on lucene-dev to discuss the idea of making deletes cheaply iterable. In that case, we could use the star tree to get the aggregations ignoring deletions, then step through the deleted docs to decrement them from the relevant buckets. |
Hi @andrross @msfroh , We can consider both approaches mentioned by @msfroh to address this gap
Updated the POC code links in prototype section. |
Out of curiosity, @bharath-techie, how does this compare to using a multidimensional points field? Your first dimension could be status code and your second dimension could be the timestamp. Traversing the BKD tree, I think you could also get some counts quickly. (Not as quickly as in the 1-D case, since you'd need to visit more "partial" leaves, but you could probably fix that by quantizing/rounding the timestamp dimension as you did here.). |
That's timely! @mgodwan and I were recently discussing about context based/aware indices where the restriction of no updates can be enforced. More details can be found at #12683. |
Based on tests against 1d points field [ status aggregation for example ] , BKD traversal is quite fast , the high latency is always because of the number of resultset documents that OpenSearch needs to aggregate upon. So I suspect that the same will be applicable for the 2d points as well. |
Segment merge for star tree index is light weight. Since it has to do summation/substraction mostly. For deleted documents, ca we create another |
The cost depends on the cardinality of the dimensions and the number of deleted documents in the segment. In worst case scenarios, 'deleted star tree index' creation might take up a lot of time. Also I'm a bit unclear on where we'll create the new ST index , as we can't modify the existing segments ? Deletes are also discussed here , one of the approaches too has similar cons , when the number of deleted docs is too high. |
For 1D points, we can aggregate via
For 2D points, The difference with the star tree seems to be that the star tree is precomputed to define the boundaries of individual rectangles (e.g. quantizing by hour). I'm wondering if we could inject some quantization (i.e. potential additional levels) into the BKD structure to achieve the same thing. |
Thanks @msfroh for detailed context. I'll do one round of benchmarks with 2d points with status and rounded off timestamp and capture how it fares. |
@bharath-techie - Thank you for writing this detailed proposal. I really like the scale of potential improvements with this index, but also wondering if the use case is generic enough to warrant separate index. I might be wrong, since yet to go through the paper, but some considerations:
That being said, this is really good stuff, given it is not default (customers need to explicitly enable it) and there would be some stats/multi-dimensional aggregation workloads. |
Thanks for the detailed feedback @jainankitk. Let me try to address some of them.
We've discussed here about some possible approaches for this. It has a cost attached to it, we need to see which is better overall.
There is a RFC which discusses this.
This is a just a case for POC, we can extend it to all buckets search currently supports. For instance, Hour bucket in POC gives same buckets as hourly histogram, that was the intention which can be extended to all the other granularities.
Good point. We've not locked in the implementation for
I did another round of benchmarks on 2.14 - date histograms are indeed blazing fast 🔥 But when we couple other aggregations together - say with status terms or sum aggs etc, then its again slow as the performance is proportional to number of documents collected/traversed via DocValues. This is discussed to some extent previously in the same RFC (as part of weight.count optimization ) Benchmarks done on Amazon AWS EC2 instance of r5.4x large on 2.14 Plain query
Star tree queryConfig :
Our opensearch/lucene queries are already so optimized, but for most of the complex aggregations, efficiency will be capped based on the amount of docs to traverse/collect, which the star tree solves. The RFC so far doesn't capture results for terms / multi-terms aggregations ( need to address some gaps like top n aggs, handling high cardinality terms etc ), the cases we found to be the most performant . |
@bharath-techie - Thank you for patiently addressing my comments and sharing some numbers with multi-field aggregation.
We can't go about storing just 'minute' right? Because then it needs to be separate bucket for every minute for different hour, day, year. I might be wrong since my understanding of star tree is limited.
Yeah, this is an option. But again, we need to be careful about how much are we asking the customers to configure. Ideally, we would like to have workload tailored indices for most optimal performance, but we choose BKD, term dictionaries because they work for most use cases and customer need not worry about them
+1. Iterating through each of the documents is indeed very slow! :(
Maybe we can try POC covering some or most of the date histogram aggregations. |
Not sure I fully understand but to elaborate on the idea - during query time, we can get docs of all the nodes of 'minute' dimension and during collection of the docs as part of 'aggregation' we bucketize it as per the histogram requested.
Ya +1 , we will integrate with templates initially and also other such mechanisms to suggest the best configuration to the users. And multiple star trees is something we will incrementally expose to customers with base functionalities already built to support them.
Do you see any challenges - If rounding off timestamp while building star tree index is of same logic as of query, we should be able to do it right ? |
I am assuming you want to have, say only status and minute dimensions from above example. In that case, the buckets for minute dimension cannot be only 0-59 or 1-60. Because you need to further qualify each of those minutes into hours/day. Else, how can I answer query like minutely aggregation on range query from 2024-01-01 to 2024-01-02 without iterating over all the documents? |
Ah I get your question , I'm not making the minute to 0-59 but rather I'm rounding off the epoch millis/seconds to epoch minute. Code reference in star tree Rounding directly based on this in datehistogram . In fact we can try reusing this same class if possible. Apart from this, I think there is a way to give fixed interval as well in histogram, which we'll need to evaluate. |
But doesn't make the cardinality "too high". Just thinking about few months/yearly log data, the cardinality becomes ~100k for 2 months
Rounding is slightly expensive operation, still much faster than the individual document iteration though. Maybe we can have multi range traversal logic on star tree similar to apache/lucene#13335, assuming we can traverse the minutely rounded epochs in sorted order |
A lot of users with timeseries type workload make a per day index, something like I think from where @bharath-techie is coming from, we can still sort of control the cardinality if, in the dimension split order, we are keeping the timestamp at a higher level than other dimensions. Something like this: timestamp -> status -> client_ip. If we do it the other way round (timestamp last), we would see a lot more branching. |
Yes +1 , choosing the right dimensionSplitOrder controls the storage size. But most importantly, in a particular segment - what will be the timestamp's cardinality ? For log/time series use cases, as long as we have efficient merge policies ( example : context aware segments ) , most likely we will only have few hours of data in a segment.
Right now 'star tree' will act as the lead iterator for the query - so far in the design , we have star tree + doc values. So once star tree gives a set of documents for the dimensions which are in tree, we further filter out in doc values if there are any other dimensions in filter. ( not part of star tree as per maxLeafDocs threshold ) Sometime back, we evaluated the idea of using point tree as well for range queries alongside star tree but point tree works well as a lead iterator rather right ? |
I am not completely sure what the right branching is, especially between timestamp/status. For example, if we have uniform distribution of status across timestamps, won't we have equal number of buckets irrespective of the order of timestamp/status? Also, maybe status first is better idea because you can quickly answer the count queries on just status code? Else you need to go through all the timestamp values?
Did we evaluate for the case of query and aggregation being on same field or even different ones?
Sorry my bad, overlooked the generation being at segment level while doing my analysis. |
The paper highly recommends that we order the dimensions from highest cardinality to lowest cardinality for better pruning during query. That said, we did benchmarks with both during POC , with And from query perspective ,
This is something we will continue to tune based on benchmarks / load testing before we rollout to production. Note : the OSB benchmarks provide out of order data , so timestamp has super high cardinality in all segments. Probably when timestamp has lower cardinality similar to production time-series workloads or when coupled with context aware segments for example, we ideally can put the user dimensions at top and timestamp columns below it.
We usually evaluate with query and aggs being on different fields, but I get where you're coming from based on histogram improvements. |
Agreed. Needs more experimentation.
Good point, I keep forgetting about that. I am assuming that is because the sorted input data is partitioned to multiple threads? |
Yes that's right. But, looks like folks run multiple instances of benchmark with |
We are starting with implementation of this RFC and the meta issue tracks the list of issues and milestones planned for near future. Please feel free to contribute / review / comment on this. |
Is your feature request related to a problem? Please describe
Aggregations are the most used query type in observability use cases and the aggregation is typically on metrics, request logs, etc. and spread across different fields. Heavy aggregation queries doesn’t have an upper bound in terms of time taken and resource consumption. In general, the performance of an aggregation query is relative to the number of documents.
Aggregations are faster when they are run on rolled up indices and transforms as they help in reducing granularity and providing materialized views. But, they are generally done once ingestion is complete, similar constructs doesn’t exists for live/active data.
Existing solutions
Aggregation framework
Existing aggregation framework in OpenSearch is quite powerful with bucket, pipeline and metric aggregations. It does support wide range of aggregations since it operates on the non altered live documents of the indices. And as it operates on live original data, deletes and updates are also supported.
While it works great, it has few cons
Index rollups / Index transforms
Index rollup jobs lets users reduce data granularity by rolling up old data into condensed indices, transform jobs lets users create a different, summarized view of users' data centered around certain fields. These solutions are generally used to save storage space and also the queries are faster as the number of documents are reduced.
Cons:
Describe the solution you'd like
Star tree index
I’m exploring the use case of pre-computing the aggregations using Star Tree index while indexing the data based on the configured fields (dimensions and metrics) during index creation. This is inspired from http://hanj.cs.illinois.edu/pdf/vldb03_starcube.pdf and Apache Pinot’s Star Tree index. Star Tree helps to enforce upper bound on the aggregation queries ensuring predictable latency and resource usage, it is also storage space efficient and configurable.
Star Tree index is a multi-field index, contrary to existing index structures in Lucene and OpenSearch which are on single field.
Improvements with Star Tree index
While it provides the above improvements, it does have its limitations, they are:
Given the above details, it does makes a case on why Star Tree is valuable to OpenSearch, I’m working on prototyping it. I’ll follow up on this issue with prototype details along with the benchmark results.
Related component
Search:Aggregations
Describe alternatives you've considered
Mentioned in the Existing Solutions.
Additional context
No response
The text was updated successfully, but these errors were encountered: