[RFC] Pre Compute Aggregations with Star Tree index #12498

bharath-techie · 2024-02-29T15:38:14Z

Is your feature request related to a problem? Please describe

Aggregations are the most used query type in observability use cases and the aggregation is typically on metrics, request logs, etc. and spread across different fields. Heavy aggregation queries doesn’t have an upper bound in terms of time taken and resource consumption. In general, the performance of an aggregation query is relative to the number of documents.

Aggregations are faster when they are run on rolled up indices and transforms as they help in reducing granularity and providing materialized views. But, they are generally done once ingestion is complete, similar constructs doesn’t exists for live/active data.

Existing solutions

Aggregation framework

Existing aggregation framework in OpenSearch is quite powerful with bucket, pipeline and metric aggregations. It does support wide range of aggregations since it operates on the non altered live documents of the indices. And as it operates on live original data, deletes and updates are also supported.

While it works great, it has few cons

Query latency scales linearly with the number of documents.
- For example, aggregation which measures count of ‘status’ in the HTTP logs workload in OSB, takes a lot longer for '200' status [~200M documents] compared to '400' status [~3000 documents]
If there are multiple aggregations on different fields, each field / aggregation is processed separately and in general it leads to higher query latencies.
- The following example captures query time as user adds more aggregations:
  - if user queries on status = 200 + aggregation of sum (size) = ~4s
  - if user queries on status = 200 + aggregations of sum (size) , avg(size) = ~6s
  - if user queries on status = 200 + aggregations of sum (size) , avg(size) , max(size) = ~7.5s

Index rollups / Index transforms

Index rollup jobs lets users reduce data granularity by rolling up old data into condensed indices, transform jobs lets users create a different, summarized view of users' data centered around certain fields. These solutions are generally used to save storage space and also the queries are faster as the number of documents are reduced.

Cons:

These solutions are generally used to reduce granularity of older data and not used on real time data.
Configuring the jobs and managing the indices is cumbersome.
- Cannot search a mix of rollup and non-rollup indexes with the same query
- Initial rollup job can become expensive depending on the size of the data

Describe the solution you'd like

Star tree index

I’m exploring the use case of pre-computing the aggregations using Star Tree index while indexing the data based on the configured fields (dimensions and metrics) during index creation. This is inspired from http://hanj.cs.illinois.edu/pdf/vldb03_starcube.pdf and Apache Pinot’s Star Tree index. Star Tree helps to enforce upper bound on the aggregation queries ensuring predictable latency and resource usage, it is also storage space efficient and configurable.

Star Tree index is a multi-field index, contrary to existing index structures in Lucene and OpenSearch which are on single field.

Improvements with Star Tree index

Users can perform faster aggregations with a constant upper bound on query latency.
Star tree natively supports multi field aggregations.
Star tree index will be created in real time as part of regular indexing, so the data in star tree will always be up to date with the live data.
Star tree index consolidates the data and is a storage efficient index, and hence brings following benefits
- Complex aggregation queries will cause fraction of IO utilization compared to present solutions.
- Paging also will be more efficient as less data gets paged for every star tree query

While it provides the above improvements, it does have its limitations, they are:

Star tree index will be created only on append-only indices, updates or deletes will not be supported.
Star tree index will be used for aggregation queries only if the query input is a subset of the star tree configuration of dimensions and metrics
Star tree index will support limited set of aggregations such as MIN, MAX, COUNT, SUM, AVG. Others are yet to be explored
The cardinality of the dimensions should not be very high (like _id fields), otherwise it leads to storage explosion and higher query latencies.
Changing Star Tree config (dimension or metrics) will require a re-index operation.

Given the above details, it does makes a case on why Star Tree is valuable to OpenSearch, I’m working on prototyping it. I’ll follow up on this issue with prototype details along with the benchmark results.

Related component

Search:Aggregations

Describe alternatives you've considered

Mentioned in the Existing Solutions.

Additional context

No response

bharath-techie · 2024-02-29T17:23:41Z

Prototype Details

For the prototype, ‘DocValuesFormat’ in Lucene is extended to create Star Tree index with support for ‘Numeric’ fields as the dimensions of the Star Tree. Timestamp field is rounded off to various granularity of epoch such as ‘minute’, ‘hour’, ‘day’ etc. and are also added as dimensions to the Star Tree.

Fig: Star Tree Index Structure

Star Tree indexing

Star tree index structure

Star Tree index structure as portrayed in the above figure, consists of mainly two parts: the Star Tree and sorted + aggregated documents backed by doc value indices.
Each node in the Star Tree points to a range of documents.
- A node is further split into child nodes based on maxLeafDocs configuration.
- The number of documents a leaf node points to, should be less than or equal to maxLeafDocs. This ensures the maximum number of documents that gets traversed to get to the aggregated value is at most maxLeafDocs, thus providing predictable latencies.
There are special nodes called ‘star nodes’ which helps in skipping non competitive nodes and also in fetching aggregated document wherever applicable during query time.
The figure contains three examples explaining the star tree traversal during query: i) Average request size for all requests containing hour 11 and status 200, ii) Count of requests for status = 200 (across all hours, hence hour is set to *) and iii) Average request size of all requests for hour 12 (status is set to *). The examples (ii) and (iii) uses star nodes.

Creation Flow

Star Tree index is created during Lucene ‘flush’ operation.

The dimension and metric values are fetched from the segment using the corresponding doc values indices. These records are then sorted by dimension split order and the metrics are aggregated.
An initial Star Tree is built over this sorted, summarized data. Further hierarchical splitting and aggregation creates specialized star nodes, enabling pruning and aggregation shortcuts for queries.
The Star Tree is then expanded further for dimensions based on ‘maxLeafDocs’ configuration set during index creation.

Merge Flow

Merge flow reuses the pre-aggregated Star Tree docs that were created during flush operation.
Post that, merge flow repeats the star tree construction process [from step 2 of creation flow]

Lucene index files

Star Tree index is stored under (new) extension ‘.sti’ part of Lucene segment.
Individual doc values indices for all the dimensions and metrics of the Star Tree and files are saved under (new) extensions ‘.std’ and ‘.stm

TODO: Create an issue with Lucene on the new formats.

Star Tree query

A new query and aggregation spec is introduced for Star Tree queries. With this new query, OpenSearch will traverse through the Star Tree index and corresponding (Star Tree) doc values to get the results.
In Lucene, ‘DocValuesReader’ is extended to read the Star Tree indices and a new method ‘getAggregatedValues’ is exposed in LeafReader to read the Star Tree indices in query time. (TODO: Open issue with Lucene).
For prototype, since only ‘Numeric’ fields are supported, actual term queries of ‘text’ which involves postings format will be compared against later when text fields support is introduced in Star Tree.
Star Tree query support can be extended for:
- Term aggregations and multi term aggregations
- Date histogram aggregations
- Range query aggregations
- Filter aggregations - Filters can be on any number of dimensions in a single query. But across dimensions, only ‘MUST’ will be supported, as filters are applied on one dimension after the next as per the dimensions split order.

POC code

bharath-techie@1d44e1a - OpenSearch code changes

bharath-techie/lucene@c993b9a - Lucene code changes

bharath-techie · 2024-02-29T18:01:14Z

Benchmark Results

OpenSearch Setup

Instance type: Amazon EC2 r5.4x.large [16 VCPU and 128 GB of RAM]
Disk: Amazon EBS volume with 300 GB storage, 125 MiBS throughput and 3000 IOPS
Number of nodes: 1
JVM heap: 32 GB

Benchmark Setup

Instance type: Amazon EC2 r5.4xlarge
Number of nodes generating load: 1

Workload

180 GB of HTTP logs indexed data [around 4 months of data]
Document distribution based on status: 140k documents of 400s compared to 1 billion documents of 200s.

Star Tree config

Dimensions

Status
Minute
Hour
Day
Year

Metric

Count of requests

Search (primarily Aggregations)

Here is a summary of how Star Tree queries fared against the aggregation queries on present day indices. [against Points index mainly ]

Query (aggregation)	Median throughput	50 percentile Service time	Comments
200s in range count	8x better performance	5x faster	3M matching documents
400s in range count	40% improvement	40%	100 matching documents
Total 200s count	1000x better performance (best case scenario)	1000x	Best case scenario [1 billion docs]
Total 400s count	1.8x to 2x improvement	2x improvement	100k matching docs
Hourly histogram	15x improvement	15x improvement

Observations

Queries in Star Tree have bigger performance gains compared to existing solutions when the number of docs matching the query is large.
In addition to query performance improvement, there was also significant reduction in page cache utilization for Star Tree queries (30x for 400 count query for example).

Indexing

Given that the aggregation now happens during the indexing, there are tradeoffs w.r.t. storage, indexing throughput, cumulative times during indexing etc. which are captured below.

Metric	Plain indexing (Baseline)	Star Tree indexing (Minutely)	Percentage
Store size	18.47 GB	18.55 GB	0.45%
Cumulative indexing time of primary shards	129.13 min	131.58 min	-1.82%
Median throughput [ Index append ]	348765.32 docs/s	342881.75 docs/s	-1.71%
50% latency [ Index append ]	98.18 ms	98.24 ms	+0.06%
90% latency [ Index append ]	131.40 ms	131.17 ms	-0.17%
99% latency [ Index append ]	513.44 ms	569.45 ms	-9%
Cumulative merge time of primary shards	34.18 min	35.59 min	-3.96%
Cumulative refresh time of primary shards	4.00 min	4.61 min	-13%
Cumulative flush time of primary shards	6.71 min	7.31 min	-8.25%

Observations

The write threads were occupied on flush flows more during indexing with Star Tree enabled. So there is some impact on the indexing throughput and latency (mainly tail latency) because of that.
The sort and aggregation part of segment records is quite heavy, hence its done off heap and it scales linearly to number of segment records during flush.
During merge, the pre-aggregated docs are reused to reconstruct the star tree instead of original docs, hence it is less expensive compared to flush.

Store size with respect to cardinality

Following captures store size with respect to cardinality of timestamp (as timestamp has one of the highest cardinalities and is present in all scenarios)

Store size without Star Tree - 16.6 GB
Star Tree index size (sti, std and stm combined)
- +2 MB - Hourly rounded timestamp [0.01%]
- +60 MB - Minutely rounded timestamp [0.4 % ]
- +800 MB - Timestamp ( second ) [ 5% ]

Hence super high cardinality fields must be avoided as part of Star Tree index.

andrross · 2024-02-29T19:54:19Z

Only a 1000x performance improvement? :) These are awesome results @bharath-techie! One minor note, I'd encourage you to link any prototype code you have on your user fork in case folks want to dive into the implementation.

Star tree index will be created only on append-only indices, updates or deletes will not be supported.

I'm sure there are lots of deep technical issues to dive into here, but I just want to highlight that to my knowledge we don't actually have a true "append-only index" abstraction in OpenSearch today. Data streams are partially that: only "create" operations are allowed when referencing the data stream entity. However, the underlying indexes that make up a data stream are still exposed to users and allow updates and deletes. I'm not sure if that is more of a side effect of the implementation, or if users truly need the option to update/delete documents on a limited basis even for a log analytics-style workload.

msfroh · 2024-02-29T22:25:10Z

I'm sure there are lots of deep technical issues to dive into here, but I just want to highlight that to my knowledge we don't actually have a true "append-only index" abstraction in OpenSearch today. Data streams are partially that: only "create" operations are allowed when referencing the data stream entity. However, the underlying indexes that make up a data stream are still exposed to users and allow updates and deletes. I'm not sure if that is more of a side effect of the implementation, or if users truly need the option to update/delete documents on a limited basis even for a log analytics-style workload.

We should be able to apply the optimization on a per-segment basis, and fall back to the classic behavior on segments with deletes, right @bharath-techie? Hopefully not all segments would have deletes. Hopefully unmerged deletes would be a transient issue (especially now that it's safe to expunge deletes).

Failing that, I kicked off a thread on lucene-dev to discuss the idea of making deletes cheaply iterable. In that case, we could use the star tree to get the aggregations ignoring deletions, then step through the deleted docs to decrement them from the relevant buckets.

bharath-techie · 2024-03-01T09:14:47Z

Hi @andrross @msfroh ,
Thanks for the comments.
Once star tree has already pre-computed the aggregations , we go through the star tree index to fetch results for the relevant queries. So document deletions post that are not accounted. Hence results will be inaccurate and that's the current limitation.

We can consider both approaches mentioned by @msfroh to address this gap

Use traditional aggregations if segment has deletes and use star tree aggregations if otherwise.
- This will work but it might not be a trivial implementation , so we can probably take this up as a follow up after initial implementation.
Iterating through deleted docs and negating in the relevant buckets
- This too works well and only issue I see is cost of recomputing is directly relative to the number of deleted docs. If the number is quite high, then it'll cause slowdowns.
- We also need to check if certain type of aggregations can be derived with the negated data. Initial supported set ( sum, avg, count, max, min ) can be derived. Other to-be explored aggregations need to be checked on case by case basis.

Updated the POC code links in prototype section.

msfroh · 2024-03-01T22:43:29Z

Out of curiosity, @bharath-techie, how does this compare to using a multidimensional points field?

Your first dimension could be status code and your second dimension could be the timestamp. Traversing the BKD tree, I think you could also get some counts quickly. (Not as quickly as in the 1-D case, since you'd need to visit more "partial" leaves, but you could probably fix that by quantizing/rounding the timestamp dimension as you did here.).

backslasht · 2024-03-04T04:52:07Z

I'm sure there are lots of deep technical issues to dive into here, but I just want to highlight that to my knowledge we don't actually have a true "append-only index" abstraction in OpenSearch today. Data streams are partially that: only "create" operations are allowed when referencing the data stream entity. However, the underlying indexes that make up a data stream are still exposed to users and allow updates and deletes. I'm not sure if that is more of a side effect of the implementation, or if users truly need the option to update/delete documents on a limited basis even for a log analytics-style workload.

That's timely! @mgodwan and I were recently discussing about context based/aware indices where the restriction of no updates can be enforced. More details can be found at #12683.

bharath-techie · 2024-03-04T08:56:41Z

Out of curiosity, @bharath-techie, how does this compare to using a multidimensional points field?

Based on tests against 1d points field [ status aggregation for example ] , BKD traversal is quite fast , the high latency is always because of the number of resultset documents that OpenSearch needs to aggregate upon.

So I suspect that the same will be applicable for the 2d points as well.

gashutos · 2024-03-05T05:53:57Z

That's timely! @mgodwan and I were recently discussing about context based/aware indices where the restriction of no updates can be enforced. More details will be added in the RFC (TODO: link it here once created).

Segment merge for star tree index is light weight. Since it has to do summation/substraction mostly. For deleted documents, ca we create another deleted star tree index ? We can do frequent merged on that index every couple of minutes since it is light weight. @backslasht @mgodwan @bharath-techie what do you think on that ?
I am strictly speaking form the view of star tree here.

bharath-techie · 2024-03-05T09:03:58Z

Segment merge for star tree index is light weight. Since it has to do summation/substraction mostly. For deleted documents, can we create another deleted star tree index ?

The cost depends on the cardinality of the dimensions and the number of deleted documents in the segment. In worst case scenarios, 'deleted star tree index' creation might take up a lot of time. Also I'm a bit unclear on where we'll create the new ST index , as we can't modify the existing segments ?

Deletes are also discussed here , one of the approaches too has similar cons , when the number of deleted docs is too high.

msfroh · 2024-03-20T18:24:40Z

Based on tests against 1d points field [ status aggregation for example ] , BKD traversal is quite fast , the high latency is always because of the number of resultset documents that OpenSearch needs to aggregate upon.

For 1D points, we can aggregate via Weight.count(), which is what we've done to improve performance of date histograms (see #9310). It doesn't need to collect the individual documents (except over the two "edge leaves"). So, it mostly does reduce to a BKD traversal (plus up to 2 leaf traversals, visiting at most 1024 docs).

So I suspect that the same will be applicable for the 2d points as well.

For 2D points, Weight.count() can't guarantee O(1) computation, there's no guarantee that the number of "edge leaves" is small (since you're not talking about leaves who intersect the endpoints of your range, but rather the number of rectangles that intersect the perimeter of your rectangle). Still, we might get good performance by summing up counts from the middle of the rectangle and then visiting the leaves that overlap the perimeter.

The difference with the star tree seems to be that the star tree is precomputed to define the boundaries of individual rectangles (e.g. quantizing by hour). I'm wondering if we could inject some quantization (i.e. potential additional levels) into the BKD structure to achieve the same thing.

bharath-techie · 2024-03-21T15:38:13Z

For 2D points, Weight.count() can't guarantee O(1) computation, there's no guarantee that the number of "edge leaves" is small (since you're not talking about leaves who intersect the endpoints of your range, but rather the number of rectangles that intersect the perimeter of your rectangle). Still, we might get good performance by summing up counts from the middle of the rectangle and then visiting the leaves that overlap the perimeter.
I'm wondering if we could inject some quantization (i.e. potential additional levels) into the BKD structure to achieve the same thing.

Thanks @msfroh for detailed context. I'll do one round of benchmarks with 2d points with status and rounded off timestamp and capture how it fares.

jainankitk · 2024-05-17T20:56:07Z

@bharath-techie - Thank you for writing this detailed proposal. I really like the scale of potential improvements with this index, but also wondering if the use case is generic enough to warrant separate index. I might be wrong, since yet to go through the paper, but some considerations:

While not having current deleted documents limitation is okay, the star tree index cannot have any deletes over the lifetime. Because once a segment has deletion, the star tree is invalid, and there is no good merge path to fix it. Some of the other optimizations for date histogram / terms aggregation also don't work for segments with deleted documents, but if the deleted documents are expunged upon merge, it can again be leveraged. Isn't it possible to account for deleted document within the star tree index during segment merge? Also, are we planning to have setting for disabling any updates/deletes completely if the customer has star tree index enabled?
Opensearch has many different ways for slicing the buckets for aggregations (weekly, quarterly, auto date histogram where the boundary might not be minute, hourly granularity), hence I not sure if minute/hour/day/year suffice? Also, we have done lot of improvements with aggregation use cases already. For example - hourly aggregation is already 100x faster. So, would love to see the benchmark with latest version 2.14 is possible.
We already have 4 dimensions for date which is common field in most documents, the moment I add few other fields the size of tree blows up and gains start reducing. Also, if the top level is status code and I want to do aggregation on just year, I need to aggregate multiple tree branches, which still might be less than iteration over all the documents, but a lot. Probably the reason hourly histogram is only 15x (lol! :P), daily and yearly histogram would be even less performant.

That being said, this is really good stuff, given it is not default (customers need to explicitly enable it) and there would be some stats/multi-dimensional aggregation workloads.

bharath-techie · 2024-05-19T12:02:27Z

Thanks for the detailed feedback @jainankitk. Let me try to address some of them.

Isn't it possible to account for deleted document within the star tree index during segment merge?

We've discussed here about some possible approaches for this. It has a cost attached to it, we need to see which is better overall.

Also, are we planning to have setting for disabling any updates/deletes completely if the customer has star tree index enabled?

There is a RFC which discusses this.

I not sure if minute/hour/day/year suffice

This is a just a case for POC, we can extend it to all buckets search currently supports. For instance, Hour bucket in POC gives same buckets as hourly histogram, that was the intention which can be extended to all the other granularities.

We already have 4 dimensions for date which is common field in most documents, the moment I add few other fields the size of tree blows up and gains start reducing

Good point. We've not locked in the implementation for date - we could also go with just supporting say 'minute' and then deriving higher time aggregations during query time if its performant and also we can have multiple star trees for different time histograms.
The size/performance of tree depends more on cardinality of fields more compared to number of fields. (more fields with less cardinality still worked well during our benchmarks )
But your point is valid , we will try to come up with approx. estimates of the resultant size of star tree and performance based on cardinality and number of fields. ( This is something I got as feedback during multiple discussions as well )

For example - hourly aggregation is already 100x faster. So, would love to see the benchmark with latest version 2.14 is possible.

I did another round of benchmarks on 2.14 - date histograms are indeed blazing fast 🔥 But when we couple other aggregations together - say with status terms or sum aggs etc, then its again slow as the performance is proportional to number of documents collected/traversed via DocValues. This is discussed to some extent previously in the same RFC (as part of weight.count optimization )

Benchmarks done on Amazon AWS EC2 instance of r5.4x large on 2.14

Plain query

Query (aggregation)	Median throughput	90th percentile latency
Hourly histogram with status terms	0.21 ops/s	4.75 seconds
Monthly histogram with status terms	0.28 ops/s	3.65 seconds
Yealy histogram with status terms	0.28 ops/s	3.65 seconds

Star tree query

Config :
minute, hour, month, year, clientIp( 1.1 million cardinality ), status

Query (aggregation)	Median throughput	90th percentile latency
Hourly histogram with status terms	3.18 ops/s	329 ms
Monthly histogram with status terms	3.46 ops/s	306 ms
Yealy histogram with status terms	3.51 ops/s	299 ms

Our opensearch/lucene queries are already so optimized, but for most of the complex aggregations, efficiency will be capped based on the amount of docs to traverse/collect, which the star tree solves.

The RFC so far doesn't capture results for terms / multi-terms aggregations ( need to address some gaps like top n aggs, handling high cardinality terms etc ), the cases we found to be the most performant .

jainankitk · 2024-05-20T06:41:31Z

@bharath-techie - Thank you for patiently addressing my comments and sharing some numbers with multi-field aggregation.

Good point. We've not locked in the implementation for date - we could also go with just supporting say 'minute' and then deriving higher time aggregations during query time if its performant

We can't go about storing just 'minute' right? Because then it needs to be separate bucket for every minute for different hour, day, year. I might be wrong since my understanding of star tree is limited.

and also we can have multiple star trees for different time histograms.

Yeah, this is an option. But again, we need to be careful about how much are we asking the customers to configure. Ideally, we would like to have workload tailored indices for most optimal performance, but we choose BKD, term dictionaries because they work for most use cases and customer need not worry about them

Our opensearch/lucene queries are already so optimized, but for most of the complex aggregations, efficiency will be capped based on the amount of docs to traverse/collect, which the star tree solves.

+1. Iterating through each of the documents is indeed very slow! :(

This is a just a case for POC, we can extend it to all buckets search currently supports. For instance, Hour bucket in POC gives same buckets as hourly histogram, that was the intention which can be extended to all the other granularities.

Maybe we can try POC covering some or most of the date histogram aggregations.

bharath-techie · 2024-05-20T09:20:19Z

We can't go about storing just 'minute' right? Because then it needs to be separate bucket for every minute for different hour, day, year.

Not sure I fully understand but to elaborate on the idea - during query time, we can get docs of all the nodes of 'minute' dimension and during collection of the docs as part of 'aggregation' we bucketize it as per the histogram requested.
The tradeoff is higher query latency but user can save storage space. ( as we need to go through all 'minute' dimension nodes and hence collect/traverse higher number of docs )

we need to be careful about how much are we asking the customers to configure

Ya +1 , we will integrate with templates initially and also other such mechanisms to suggest the best configuration to the users. And multiple star trees is something we will incrementally expose to customers with base functionalities already built to support them.

we can try POC covering some or most of the date histogram aggregations.

Do you see any challenges - If rounding off timestamp while building star tree index is of same logic as of query, we should be able to do it right ?

jainankitk · 2024-05-20T16:12:53Z

Not sure I fully understand but to elaborate on the idea - during query time, we can get docs of all the nodes of 'minute' dimension and during collection of the docs as part of 'aggregation' we bucketize it as per the histogram requested.

I am assuming you want to have, say only status and minute dimensions from above example. In that case, the buckets for minute dimension cannot be only 0-59 or 1-60. Because you need to further qualify each of those minutes into hours/day. Else, how can I answer query like minutely aggregation on range query from 2024-01-01 to 2024-01-02 without iterating over all the documents?

bharath-techie · 2024-05-20T17:42:55Z

In that case, the buckets for minute dimension cannot be only 0-59 or 1-60.

Ah I get your question , I'm not making the minute to 0-59 but rather I'm rounding off the epoch millis/seconds to epoch minute.
We'll again be able to round off epoch minute to epoch hour or higher time dimensions during aggregation and bucketize as per the query.

Code reference in star tree

Rounding directly based on this in datehistogram . In fact we can try reusing this same class if possible.

Apart from this, I think there is a way to give fixed interval as well in histogram, which we'll need to evaluate.

jainankitk · 2024-05-20T18:03:08Z

Ah I get your question , I'm not making the minute to 0-59 but rather I'm rounding off the epoch millis/seconds to epoch minute.

But doesn't make the cardinality "too high". Just thinking about few months/yearly log data, the cardinality becomes ~100k for 2 months

Rounding directly based on this in datehistogram . In fact we can try reusing this same class if possible.

Rounding is slightly expensive operation, still much faster than the individual document iteration though. Maybe we can have multi range traversal logic on star tree similar to apache/lucene#13335, assuming we can traverse the minutely rounded epochs in sorted order

sarthakaggarwal97 · 2024-05-21T02:52:20Z

But doesn't make the cardinality "too high". Just thinking about few months/yearly log data, the cardinality becomes ~100k for 2 months

A lot of users with timeseries type workload make a per day index, something like logs-24-11-2024. Since each index will have it's own star tree for the set of dimension/metrics, it will come down to the query part to handle these aggregation. Also, I feel with minutes is one option, but we can explore different ideas to handle queries at different time granularity.

I think from where @bharath-techie is coming from, we can still sort of control the cardinality if, in the dimension split order, we are keeping the timestamp at a higher level than other dimensions.

Something like this: timestamp -> status -> client_ip. If we do it the other way round (timestamp last), we would see a lot more branching.

bharath-techie · 2024-05-21T03:17:18Z

Yes +1 , choosing the right dimensionSplitOrder controls the storage size.

But most importantly, in a particular segment - what will be the timestamp's cardinality ? For log/time series use cases, as long as we have efficient merge policies ( example : context aware segments ) , most likely we will only have few hours of data in a segment.
So even if an index spans several months, in each segment cardinality of the timestamp field will be quite less.

Maybe we can have multi range traversal logic on star tree similar to apache/lucene#13335,

Right now 'star tree' will act as the lead iterator for the query - so far in the design , we have star tree + doc values. So once star tree gives a set of documents for the dimensions which are in tree, we further filter out in doc values if there are any other dimensions in filter. ( not part of star tree as per maxLeafDocs threshold )

Sometime back, we evaluated the idea of using point tree as well for range queries alongside star tree but point tree works well as a lead iterator rather right ?

jainankitk · 2024-05-24T00:58:49Z

Something like this: timestamp -> status -> client_ip. If we do it the other way round (timestamp last), we would see a lot more branching.
Yes +1 , choosing the right dimensionSplitOrder controls the storage size.

I am not completely sure what the right branching is, especially between timestamp/status. For example, if we have uniform distribution of status across timestamps, won't we have equal number of buckets irrespective of the order of timestamp/status? Also, maybe status first is better idea because you can quickly answer the count queries on just status code? Else you need to go through all the timestamp values?

Sometime back, we evaluated the idea of using point tree as well for range queries alongside star tree but point tree works well as a lead iterator rather right ?

Did we evaluate for the case of query and aggregation being on same field or even different ones?

But most importantly, in a particular segment - what will be the timestamp's cardinality ? For log/time series use cases, as long as we have efficient merge policies ( #13183 ) , most likely we will only have few hours of data in a segment.

Sorry my bad, overlooked the generation being at segment level while doing my analysis.

bharath-techie · 2024-05-24T04:16:28Z

Also, maybe status first is better idea because you can quickly answer the count queries on just status code? Else you need to go through all the timestamp values?

The paper highly recommends that we order the dimensions from highest cardinality to lowest cardinality for better pruning during query.

That said, we did benchmarks with both during POC , with status first we noticed significantly higher storage size. ( > 70% if I remember correctly , since timestamp which has high cardinality is present across each status)

And from query perspective , status first indeed performed better for status specific aggs [without any range filter ] but for range queries and date histogram we either found timestamp first to be faster or within a 5-10% range off from status first performance.

Query	Status first throughput	Timestamp first throughput
200s in range	91.74 ops/s	134.45 ops/s
400s in range	154.12 ops/s	139.19 ops/s
400s-agg	186 ops/s	112.82 ops/s
200s-agg	177 ops/s	85.45 ops/s

This is something we will continue to tune based on benchmarks / load testing before we rollout to production.

Note : the OSB benchmarks provide out of order data , so timestamp has super high cardinality in all segments. Probably when timestamp has lower cardinality similar to production time-series workloads or when coupled with context aware segments for example, we ideally can put the user dimensions at top and timestamp columns below it.

Did we evaluate for the case of query and aggregation being on same field or even different ones?

We usually evaluate with query and aggs being on different fields, but I get where you're coming from based on histogram improvements.
We just bounced off ideas, didn't get around to any implementation. Because we found star tree traversal to be decently fast - on a query on multiple dimensions especially - the latency again mainly came from doc values collection.

jainankitk · 2024-05-24T06:18:29Z

This is something we will continue to tune based on benchmarks / load testing before we rollout to production.

Agreed. Needs more experimentation.

Note : the OSB benchmarks provide out of order data , so timestamp has super high cardinality in all segments. Probably when timestamp has lower cardinality similar to production time-series workloads or when coupled with context aware segments for example, we ideally can put the user dimensions at top and timestamp columns below it.

Good point, I keep forgetting about that. I am assuming that is because the sorted input data is partitioned to multiple threads?

bharath-techie · 2024-05-24T12:41:19Z

Good point, I keep forgetting about that. I am assuming that is because the sorted input data is partitioned to multiple threads?

Yes that's right. But, looks like folks run multiple instances of benchmark with numclients:1 to bypass this limitation. But so far all benchmarks are done without such modifications.

bharath-techie · 2024-05-29T14:34:31Z

We are starting with implementation of this RFC and the meta issue tracks the list of issues and milestones planned for near future.

Please feel free to contribute / review / comment on this.

bharath-techie added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 29, 2024

github-actions bot added the Search:Aggregations label Feb 29, 2024

github-project-automation bot added this to Search Project Board Feb 29, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 29, 2024

bharath-techie self-assigned this Feb 29, 2024

backslasht added the Indexing & Search label Feb 29, 2024

sarthakaggarwal97 removed the untriaged label Feb 29, 2024

backslasht added the Indexing Indexing, Bulk Indexing and anything related to indexing label Feb 29, 2024

andrross removed the Indexing & Search label Feb 29, 2024

rramachand21 added RFC Issues requesting major changes Performance This is for any performance related enhancements or bugs labels Mar 1, 2024

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Mar 1, 2024

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Mar 1, 2024

rramachand21 added Indexing:Performance and removed Indexing:Performance labels Mar 1, 2024

mgodwan mentioned this issue Mar 15, 2024

RFC: Application Based Configuration Templates #12683

Closed

bharath-techie mentioned this issue Mar 18, 2024

Support for building materialized views using Lucene formats apache/lucene#13188

Open

RS146BIJAY mentioned this issue May 20, 2024

Support for criteria based DWPT selection inside DocumentWriter apache/lucene#13387

Open

bharath-techie mentioned this issue May 29, 2024

[META] Star Tree Index issue list #13875

Open

23 tasks

bharath-techie mentioned this issue May 31, 2024

Star tree index config changes #13917

Closed

9 tasks

getsaurabh02 added this to OpenSearch Roadmap May 31, 2024

github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024

sarthakaggarwal97 mentioned this issue Jun 10, 2024

[Feature Request] Star Tree Tree Builder [On Heap] #14116

Closed

This was referenced Jun 17, 2024

[Star tree] Star tree index configuration via mappings #14386

Closed

[Star tree] Star tree codec changes #14387

Closed

bowenlan-amzn mentioned this issue Jun 18, 2024

[Proposal] Support sub aggregation in filter rewrite optimization #12602

Open

bharath-techie added the v2.16.0 Issues and PRs related to version 2.16.0 label Jul 10, 2024

getsaurabh02 added v2.17.0 and removed v2.16.0 Issues and PRs related to version 2.16.0 labels Jul 25, 2024

getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024

Bukhtawar mentioned this issue Aug 19, 2024

Adding @jainankitk as a Maintainer #15304

Merged

bharath-techie mentioned this issue Aug 30, 2024

[FEATURE] Add specs for star tree index feature opensearch-project/opensearch-api-specification#542

Open

bharath-techie mentioned this issue Oct 23, 2024

Add documentation for star tree index feature opensearch-project/documentation-website#8598

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Pre Compute Aggregations with Star Tree index #12498

[RFC] Pre Compute Aggregations with Star Tree index #12498

bharath-techie commented Feb 29, 2024 •

edited

Loading

bharath-techie commented Feb 29, 2024 •

edited

Loading

bharath-techie commented Feb 29, 2024

andrross commented Feb 29, 2024 •

edited

Loading

msfroh commented Feb 29, 2024

bharath-techie commented Mar 1, 2024 •

edited

Loading

msfroh commented Mar 1, 2024

backslasht commented Mar 4, 2024 •

edited

Loading

bharath-techie commented Mar 4, 2024

gashutos commented Mar 5, 2024

bharath-techie commented Mar 5, 2024

msfroh commented Mar 20, 2024

bharath-techie commented Mar 21, 2024

jainankitk commented May 17, 2024 •

edited

Loading

bharath-techie commented May 19, 2024

jainankitk commented May 20, 2024

bharath-techie commented May 20, 2024

jainankitk commented May 20, 2024 •

edited

Loading

bharath-techie commented May 20, 2024

jainankitk commented May 20, 2024

sarthakaggarwal97 commented May 21, 2024 •

edited

Loading

bharath-techie commented May 21, 2024 •

edited

Loading

jainankitk commented May 24, 2024

bharath-techie commented May 24, 2024 •

edited

Loading

jainankitk commented May 24, 2024

bharath-techie commented May 24, 2024

bharath-techie commented May 29, 2024

[RFC] Pre Compute Aggregations with Star Tree index #12498

[RFC] Pre Compute Aggregations with Star Tree index #12498

Comments

bharath-techie commented Feb 29, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Existing solutions

Aggregation framework

Index rollups / Index transforms

Describe the solution you'd like

Star tree index

Improvements with Star Tree index

Related component

Describe alternatives you've considered

Additional context

bharath-techie commented Feb 29, 2024 • edited Loading

Prototype Details

Star Tree indexing

Star tree index structure

Creation Flow

Merge Flow

Lucene index files

Star Tree query

POC code

bharath-techie commented Feb 29, 2024

Benchmark Results

OpenSearch Setup

Benchmark Setup

Workload

Star Tree config

Dimensions

Metric

Search (primarily Aggregations)

Observations

Indexing

Observations

Store size with respect to cardinality

andrross commented Feb 29, 2024 • edited Loading

msfroh commented Feb 29, 2024

bharath-techie commented Mar 1, 2024 • edited Loading

msfroh commented Mar 1, 2024

backslasht commented Mar 4, 2024 • edited Loading

bharath-techie commented Mar 4, 2024

gashutos commented Mar 5, 2024

bharath-techie commented Mar 5, 2024

msfroh commented Mar 20, 2024

bharath-techie commented Mar 21, 2024

jainankitk commented May 17, 2024 • edited Loading

bharath-techie commented May 19, 2024

Plain query

Star tree query

jainankitk commented May 20, 2024

bharath-techie commented May 20, 2024

jainankitk commented May 20, 2024 • edited Loading

bharath-techie commented May 20, 2024

jainankitk commented May 20, 2024

sarthakaggarwal97 commented May 21, 2024 • edited Loading

bharath-techie commented May 21, 2024 • edited Loading

jainankitk commented May 24, 2024

bharath-techie commented May 24, 2024 • edited Loading

jainankitk commented May 24, 2024

bharath-techie commented May 24, 2024

bharath-techie commented May 29, 2024

bharath-techie commented Feb 29, 2024 •

edited

Loading

bharath-techie commented Feb 29, 2024 •

edited

Loading

andrross commented Feb 29, 2024 •

edited

Loading

bharath-techie commented Mar 1, 2024 •

edited

Loading

backslasht commented Mar 4, 2024 •

edited

Loading

jainankitk commented May 17, 2024 •

edited

Loading

jainankitk commented May 20, 2024 •

edited

Loading

sarthakaggarwal97 commented May 21, 2024 •

edited

Loading

bharath-techie commented May 21, 2024 •

edited

Loading

bharath-techie commented May 24, 2024 •

edited

Loading