Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Pre Compute Aggregations with Star Tree index #12498

Open
bharath-techie opened this issue Feb 29, 2024 · 37 comments
Open

[RFC] Pre Compute Aggregations with Star Tree index #12498

bharath-techie opened this issue Feb 29, 2024 · 37 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Performance This is for any performance related enhancements or bugs RFC Issues requesting major changes Roadmap:Cost/Performance/Scale Project-wide roadmap label Search:Aggregations v2.17.0

Comments

@bharath-techie
Copy link
Contributor

bharath-techie commented Feb 29, 2024

Is your feature request related to a problem? Please describe

Aggregations are the most used query type in observability use cases and the aggregation is typically on metrics, request logs, etc. and spread across different fields. Heavy aggregation queries doesn’t have an upper bound in terms of time taken and resource consumption. In general, the performance of an aggregation query is relative to the number of documents.

Aggregations are faster when they are run on rolled up indices and transforms as they help in reducing granularity and providing materialized views. But, they are generally done once ingestion is complete, similar constructs doesn’t exists for live/active data.

Existing solutions

Aggregation framework

Existing aggregation framework in OpenSearch is quite powerful with bucket, pipeline and metric aggregations. It does support wide range of aggregations since it operates on the non altered live documents of the indices. And as it operates on live original data, deletes and updates are also supported.

While it works great, it has few cons

  • Query latency scales linearly with the number of documents.
    • For example, aggregation which measures count of ‘status’ in the HTTP logs workload in OSB, takes a lot longer for '200' status [~200M documents] compared to '400' status [~3000 documents]
  • If there are multiple aggregations on different fields, each field / aggregation is processed separately and in general it leads to higher query latencies.
    • The following example captures query time as user adds more aggregations:
      • if user queries on status = 200 + aggregation of sum (size) = ~4s
      • if user queries on status = 200 + aggregations of sum (size) , avg(size) = ~6s
      • if user queries on status = 200 + aggregations of sum (size) , avg(size) , max(size) = ~7.5s

Index rollups / Index transforms

Index rollup jobs lets users reduce data granularity by rolling up old data into condensed indices, transform jobs lets users create a different, summarized view of users' data centered around certain fields. These solutions are generally used to save storage space and also the queries are faster as the number of documents are reduced.

Cons:

  • These solutions are generally used to reduce granularity of older data and not used on real time data.
  • Configuring the jobs and managing the indices is cumbersome.
    • Cannot search a mix of rollup and non-rollup indexes with the same query
    • Initial rollup job can become expensive depending on the size of the data

Describe the solution you'd like

Star tree index

I’m exploring the use case of pre-computing the aggregations using Star Tree index while indexing the data based on the configured fields (dimensions and metrics) during index creation. This is inspired from http://hanj.cs.illinois.edu/pdf/vldb03_starcube.pdf and Apache Pinot’s Star Tree index. Star Tree helps to enforce upper bound on the aggregation queries ensuring predictable latency and resource usage, it is also storage space efficient and configurable.

Star Tree index is a multi-field index, contrary to existing index structures in Lucene and OpenSearch which are on single field.

Improvements with Star Tree index

  • Users can perform faster aggregations with a constant upper bound on query latency.
  • Star tree natively supports multi field aggregations.
  • Star tree index will be created in real time as part of regular indexing, so the data in star tree will always be up to date with the live data.
  • Star tree index consolidates the data and is a storage efficient index, and hence brings following benefits
    • Complex aggregation queries will cause fraction of IO utilization compared to present solutions.
    • Paging also will be more efficient as less data gets paged for every star tree query

While it provides the above improvements, it does have its limitations, they are:

  • Star tree index will be created only on append-only indices, updates or deletes will not be supported.
  • Star tree index will be used for aggregation queries only if the query input is a subset of the star tree configuration of dimensions and metrics
  • Star tree index will support limited set of aggregations such as MIN, MAX, COUNT, SUM, AVG. Others are yet to be explored
  • The cardinality of the dimensions should not be very high (like _id fields), otherwise it leads to storage explosion and higher query latencies.
  • Changing Star Tree config (dimension or metrics) will require a re-index operation.

Given the above details, it does makes a case on why Star Tree is valuable to OpenSearch, I’m working on prototyping it. I’ll follow up on this issue with prototype details along with the benchmark results.

Related component

Search:Aggregations

Describe alternatives you've considered

Mentioned in the Existing Solutions.

Additional context

No response

@bharath-techie bharath-techie added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 29, 2024
@bharath-techie bharath-techie self-assigned this Feb 29, 2024
@backslasht backslasht added the Indexing Indexing, Bulk Indexing and anything related to indexing label Feb 29, 2024
@bharath-techie
Copy link
Contributor Author

bharath-techie commented Feb 29, 2024

Prototype Details

For the prototype, ‘DocValuesFormat’ in Lucene is extended to create Star Tree index with support for ‘Numeric’ fields as the dimensions of the Star Tree. Timestamp field is rounded off to various granularity of epoch such as ‘minute’, ‘hour’, ‘day’ etc. and are also added as dimensions to the Star Tree.

Fig: Star Tree Index Structure

Star tree index structure

Star Tree indexing

Star tree index structure

  • Star Tree index structure as portrayed in the above figure, consists of mainly two parts: the Star Tree and sorted + aggregated documents backed by doc value indices.

  • Each node in the Star Tree points to a range of documents.

    • A node is further split into child nodes based on maxLeafDocs configuration.
    • The number of documents a leaf node points to, should be less than or equal to maxLeafDocs. This ensures the maximum number of documents that gets traversed to get to the aggregated value is at most maxLeafDocs, thus providing predictable latencies.
  • There are special nodes called ‘star nodes’ which helps in skipping non competitive nodes and also in fetching aggregated document wherever applicable during query time.

  • The figure contains three examples explaining the star tree traversal during query: i) Average request size for all requests containing hour 11 and status 200, ii) Count of requests for status = 200 (across all hours, hence hour is set to *) and iii) Average request size of all requests for hour 12 (status is set to *). The examples (ii) and (iii) uses star nodes.

Creation Flow

Star Tree index is created during Lucene ‘flush’ operation.

  1. The dimension and metric values are fetched from the segment using the corresponding doc values indices. These records are then sorted by dimension split order and the metrics are aggregated.
  2. An initial Star Tree is built over this sorted, summarized data. Further hierarchical splitting and aggregation creates specialized star nodes, enabling pruning and aggregation shortcuts for queries.
  3. The Star Tree is then expanded further for dimensions based on ‘maxLeafDocs’ configuration set during index creation.

Merge Flow

  1. Merge flow reuses the pre-aggregated Star Tree docs that were created during flush operation.
  2. Post that, merge flow repeats the star tree construction process [from step 2 of creation flow]

Lucene index files

  • Star Tree index is stored under (new) extension ‘.sti’ part of Lucene segment.
  • Individual doc values indices for all the dimensions and metrics of the Star Tree and files are saved under (new) extensions ‘.std’ and ‘.stm

TODO: Create an issue with Lucene on the new formats.

Star Tree query

  • A new query and aggregation spec is introduced for Star Tree queries. With this new query, OpenSearch will traverse through the Star Tree index and corresponding (Star Tree) doc values to get the results.
  • In Lucene, ‘DocValuesReader’ is extended to read the Star Tree indices and a new method ‘getAggregatedValues’ is exposed in LeafReader to read the Star Tree indices in query time. (TODO: Open issue with Lucene).
  • For prototype, since only ‘Numeric’ fields are supported, actual term queries of ‘text’ which involves postings format will be compared against later when text fields support is introduced in Star Tree.
  • Star Tree query support can be extended for:
    • Term aggregations and multi term aggregations
    • Date histogram aggregations
    • Range query aggregations
    • Filter aggregations - Filters can be on any number of dimensions in a single query. But across dimensions, only ‘MUST’ will be supported, as filters are applied on one dimension after the next as per the dimensions split order.

POC code

bharath-techie@1d44e1a - OpenSearch code changes

bharath-techie/lucene@c993b9a - Lucene code changes

@bharath-techie
Copy link
Contributor Author

Benchmark Results

OpenSearch Setup

  • Instance type: Amazon EC2 r5.4x.large [16 VCPU and 128 GB of RAM]
  • Disk: Amazon EBS volume with 300 GB storage, 125 MiBS throughput and 3000 IOPS
  • Number of nodes: 1
  • JVM heap: 32 GB

Benchmark Setup

  • Instance type: Amazon EC2 r5.4xlarge
  • Number of nodes generating load: 1

Workload

  • 180 GB of HTTP logs indexed data [around 4 months of data]
  • Document distribution based on status: 140k documents of 400s compared to 1 billion documents of 200s.

Star Tree config

Dimensions

  1. Status
  2. Minute
  3. Hour
  4. Day
  5. Year

Metric

  • Count of requests

Search (primarily Aggregations)

Here is a summary of how Star Tree queries fared against the aggregation queries on present day indices. [against Points index mainly ]

Query (aggregation) Median throughput 50 percentile Service time Comments
200s in range count 8x better performance 5x faster 3M matching documents
400s in range count 40% improvement 40% 100 matching documents
Total 200s count 1000x better performance (best case scenario) 1000x Best case scenario [1 billion docs]
Total 400s count 1.8x to 2x improvement 2x improvement 100k matching docs
Hourly histogram 15x improvement 15x improvement  

Observations

  • Queries in Star Tree have bigger performance gains compared to existing solutions when the number of docs matching the query is large.
  • In addition to query performance improvement, there was also significant reduction in page cache utilization for Star Tree queries (30x for 400 count query for example).

Indexing

Given that the aggregation now happens during the indexing, there are tradeoffs w.r.t. storage, indexing throughput, cumulative times during indexing etc. which are captured below.

Metric Plain indexing (Baseline) Star Tree indexing (Minutely) Percentage
Store size 18.47 GB 18.55 GB 0.45%
Cumulative indexing time of primary shards 129.13 min 131.58 min -1.82%
Median throughput [ Index append ] 348765.32 docs/s 342881.75 docs/s -1.71%
50% latency [ Index append ] 98.18 ms 98.24 ms +0.06%
90% latency [ Index append ] 131.40 ms 131.17 ms -0.17%
99% latency [ Index append ] 513.44 ms 569.45 ms -9%
Cumulative merge time of primary shards 34.18 min 35.59 min -3.96%
Cumulative refresh time of primary shards 4.00 min 4.61 min -13%
Cumulative flush time of primary shards 6.71 min 7.31 min -8.25%

Observations

  • The write threads were occupied on flush flows more during indexing with Star Tree enabled. So there is some impact on the indexing throughput and latency (mainly tail latency) because of that.
  • The sort and aggregation part of segment records is quite heavy, hence its done off heap and it scales linearly to number of segment records during flush.
  • During merge, the pre-aggregated docs are reused to reconstruct the star tree instead of original docs, hence it is less expensive compared to flush.

Store size with respect to cardinality

Following captures store size with respect to cardinality of timestamp (as timestamp has one of the highest cardinalities and is present in all scenarios)

  • Store size without Star Tree - 16.6 GB
  • Star Tree index size (sti, std and stm combined)
    • +2 MB - Hourly rounded timestamp [0.01%]
    • +60 MB - Minutely rounded timestamp [0.4 % ]
    • +800 MB - Timestamp ( second ) [ 5% ]

Hence super high cardinality fields must be avoided as part of Star Tree index.

@andrross
Copy link
Member

andrross commented Feb 29, 2024

Only a 1000x performance improvement? :) These are awesome results @bharath-techie! One minor note, I'd encourage you to link any prototype code you have on your user fork in case folks want to dive into the implementation.

Star tree index will be created only on append-only indices, updates or deletes will not be supported.

I'm sure there are lots of deep technical issues to dive into here, but I just want to highlight that to my knowledge we don't actually have a true "append-only index" abstraction in OpenSearch today. Data streams are partially that: only "create" operations are allowed when referencing the data stream entity. However, the underlying indexes that make up a data stream are still exposed to users and allow updates and deletes. I'm not sure if that is more of a side effect of the implementation, or if users truly need the option to update/delete documents on a limited basis even for a log analytics-style workload.

@msfroh
Copy link
Collaborator

msfroh commented Feb 29, 2024

I'm sure there are lots of deep technical issues to dive into here, but I just want to highlight that to my knowledge we don't actually have a true "append-only index" abstraction in OpenSearch today. Data streams are partially that: only "create" operations are allowed when referencing the data stream entity. However, the underlying indexes that make up a data stream are still exposed to users and allow updates and deletes. I'm not sure if that is more of a side effect of the implementation, or if users truly need the option to update/delete documents on a limited basis even for a log analytics-style workload.

We should be able to apply the optimization on a per-segment basis, and fall back to the classic behavior on segments with deletes, right @bharath-techie? Hopefully not all segments would have deletes. Hopefully unmerged deletes would be a transient issue (especially now that it's safe to expunge deletes).

Failing that, I kicked off a thread on lucene-dev to discuss the idea of making deletes cheaply iterable. In that case, we could use the star tree to get the aggregations ignoring deletions, then step through the deleted docs to decrement them from the relevant buckets.

@bharath-techie
Copy link
Contributor Author

bharath-techie commented Mar 1, 2024

Hi @andrross @msfroh ,
Thanks for the comments.
Once star tree has already pre-computed the aggregations , we go through the star tree index to fetch results for the relevant queries. So document deletions post that are not accounted. Hence results will be inaccurate and that's the current limitation.

We can consider both approaches mentioned by @msfroh to address this gap

  • Use traditional aggregations if segment has deletes and use star tree aggregations if otherwise.
    • This will work but it might not be a trivial implementation , so we can probably take this up as a follow up after initial implementation.
  • Iterating through deleted docs and negating in the relevant buckets
    • This too works well and only issue I see is cost of recomputing is directly relative to the number of deleted docs. If the number is quite high, then it'll cause slowdowns.
    • We also need to check if certain type of aggregations can be derived with the negated data. Initial supported set ( sum, avg, count, max, min ) can be derived. Other to-be explored aggregations need to be checked on case by case basis.

Updated the POC code links in prototype section.

@msfroh
Copy link
Collaborator

msfroh commented Mar 1, 2024

Out of curiosity, @bharath-techie, how does this compare to using a multidimensional points field?

Your first dimension could be status code and your second dimension could be the timestamp. Traversing the BKD tree, I think you could also get some counts quickly. (Not as quickly as in the 1-D case, since you'd need to visit more "partial" leaves, but you could probably fix that by quantizing/rounding the timestamp dimension as you did here.).

@backslasht
Copy link
Contributor

backslasht commented Mar 4, 2024

I'm sure there are lots of deep technical issues to dive into here, but I just want to highlight that to my knowledge we don't actually have a true "append-only index" abstraction in OpenSearch today. Data streams are partially that: only "create" operations are allowed when referencing the data stream entity. However, the underlying indexes that make up a data stream are still exposed to users and allow updates and deletes. I'm not sure if that is more of a side effect of the implementation, or if users truly need the option to update/delete documents on a limited basis even for a log analytics-style workload.

That's timely! @mgodwan and I were recently discussing about context based/aware indices where the restriction of no updates can be enforced. More details can be found at #12683.

@bharath-techie
Copy link
Contributor Author

Out of curiosity, @bharath-techie, how does this compare to using a multidimensional points field?

Based on tests against 1d points field [ status aggregation for example ] , BKD traversal is quite fast , the high latency is always because of the number of resultset documents that OpenSearch needs to aggregate upon.

So I suspect that the same will be applicable for the 2d points as well.

@gashutos
Copy link
Contributor

gashutos commented Mar 5, 2024

That's timely! @mgodwan and I were recently discussing about context based/aware indices where the restriction of no updates can be enforced. More details will be added in the RFC (TODO: link it here once created).

Segment merge for star tree index is light weight. Since it has to do summation/substraction mostly. For deleted documents, ca we create another deleted star tree index ? We can do frequent merged on that index every couple of minutes since it is light weight. @backslasht @mgodwan @bharath-techie what do you think on that ?
I am strictly speaking form the view of star tree here.

@bharath-techie
Copy link
Contributor Author

Segment merge for star tree index is light weight. Since it has to do summation/substraction mostly. For deleted documents, can we create another deleted star tree index ?

The cost depends on the cardinality of the dimensions and the number of deleted documents in the segment. In worst case scenarios, 'deleted star tree index' creation might take up a lot of time. Also I'm a bit unclear on where we'll create the new ST index , as we can't modify the existing segments ?

Deletes are also discussed here , one of the approaches too has similar cons , when the number of deleted docs is too high.

@msfroh
Copy link
Collaborator

msfroh commented Mar 20, 2024

Based on tests against 1d points field [ status aggregation for example ] , BKD traversal is quite fast , the high latency is always because of the number of resultset documents that OpenSearch needs to aggregate upon.

For 1D points, we can aggregate via Weight.count(), which is what we've done to improve performance of date histograms (see #9310). It doesn't need to collect the individual documents (except over the two "edge leaves"). So, it mostly does reduce to a BKD traversal (plus up to 2 leaf traversals, visiting at most 1024 docs).

So I suspect that the same will be applicable for the 2d points as well.

For 2D points, Weight.count() can't guarantee O(1) computation, there's no guarantee that the number of "edge leaves" is small (since you're not talking about leaves who intersect the endpoints of your range, but rather the number of rectangles that intersect the perimeter of your rectangle). Still, we might get good performance by summing up counts from the middle of the rectangle and then visiting the leaves that overlap the perimeter.

The difference with the star tree seems to be that the star tree is precomputed to define the boundaries of individual rectangles (e.g. quantizing by hour). I'm wondering if we could inject some quantization (i.e. potential additional levels) into the BKD structure to achieve the same thing.

@bharath-techie
Copy link
Contributor Author

For 2D points, Weight.count() can't guarantee O(1) computation, there's no guarantee that the number of "edge leaves" is small (since you're not talking about leaves who intersect the endpoints of your range, but rather the number of rectangles that intersect the perimeter of your rectangle). Still, we might get good performance by summing up counts from the middle of the rectangle and then visiting the leaves that overlap the perimeter.
I'm wondering if we could inject some quantization (i.e. potential additional levels) into the BKD structure to achieve the same thing.

Thanks @msfroh for detailed context. I'll do one round of benchmarks with 2d points with status and rounded off timestamp and capture how it fares.

@jainankitk
Copy link
Collaborator

jainankitk commented May 17, 2024

@bharath-techie - Thank you for writing this detailed proposal. I really like the scale of potential improvements with this index, but also wondering if the use case is generic enough to warrant separate index. I might be wrong, since yet to go through the paper, but some considerations:

  • While not having current deleted documents limitation is okay, the star tree index cannot have any deletes over the lifetime. Because once a segment has deletion, the star tree is invalid, and there is no good merge path to fix it. Some of the other optimizations for date histogram / terms aggregation also don't work for segments with deleted documents, but if the deleted documents are expunged upon merge, it can again be leveraged. Isn't it possible to account for deleted document within the star tree index during segment merge? Also, are we planning to have setting for disabling any updates/deletes completely if the customer has star tree index enabled?
  • Opensearch has many different ways for slicing the buckets for aggregations (weekly, quarterly, auto date histogram where the boundary might not be minute, hourly granularity), hence I not sure if minute/hour/day/year suffice? Also, we have done lot of improvements with aggregation use cases already. For example - hourly aggregation is already 100x faster. So, would love to see the benchmark with latest version 2.14 is possible.
  • We already have 4 dimensions for date which is common field in most documents, the moment I add few other fields the size of tree blows up and gains start reducing. Also, if the top level is status code and I want to do aggregation on just year, I need to aggregate multiple tree branches, which still might be less than iteration over all the documents, but a lot. Probably the reason hourly histogram is only 15x (lol! :P), daily and yearly histogram would be even less performant.

That being said, this is really good stuff, given it is not default (customers need to explicitly enable it) and there would be some stats/multi-dimensional aggregation workloads.

@bharath-techie
Copy link
Contributor Author

Thanks for the detailed feedback @jainankitk. Let me try to address some of them.

Isn't it possible to account for deleted document within the star tree index during segment merge?

We've discussed here about some possible approaches for this. It has a cost attached to it, we need to see which is better overall.

Also, are we planning to have setting for disabling any updates/deletes completely if the customer has star tree index enabled?

There is a RFC which discusses this.

I not sure if minute/hour/day/year suffice

This is a just a case for POC, we can extend it to all buckets search currently supports. For instance, Hour bucket in POC gives same buckets as hourly histogram, that was the intention which can be extended to all the other granularities.

We already have 4 dimensions for date which is common field in most documents, the moment I add few other fields the size of tree blows up and gains start reducing

Good point. We've not locked in the implementation for date - we could also go with just supporting say 'minute' and then deriving higher time aggregations during query time if its performant and also we can have multiple star trees for different time histograms.
The size/performance of tree depends more on cardinality of fields more compared to number of fields. (more fields with less cardinality still worked well during our benchmarks )
But your point is valid , we will try to come up with approx. estimates of the resultant size of star tree and performance based on cardinality and number of fields. ( This is something I got as feedback during multiple discussions as well )

For example - hourly aggregation is already 100x faster. So, would love to see the benchmark with latest version 2.14 is possible.

I did another round of benchmarks on 2.14 - date histograms are indeed blazing fast 🔥 But when we couple other aggregations together - say with status terms or sum aggs etc, then its again slow as the performance is proportional to number of documents collected/traversed via DocValues. This is discussed to some extent previously in the same RFC (as part of weight.count optimization )

Benchmarks done on Amazon AWS EC2 instance of r5.4x large on 2.14

Plain query

Query (aggregation) Median throughput 90th percentile latency
Hourly histogram with status terms 0.21 ops/s 4.75 seconds
Monthly histogram with status terms 0.28 ops/s 3.65 seconds
Yealy histogram with status terms 0.28 ops/s 3.65 seconds

Star tree query

Config :
minute, hour, month, year, clientIp( 1.1 million cardinality ), status

Query (aggregation) Median throughput 90th percentile latency
Hourly histogram with status terms 3.18 ops/s 329 ms
Monthly histogram with status terms 3.46 ops/s 306 ms
Yealy histogram with status terms 3.51 ops/s 299 ms

Our opensearch/lucene queries are already so optimized, but for most of the complex aggregations, efficiency will be capped based on the amount of docs to traverse/collect, which the star tree solves.

The RFC so far doesn't capture results for terms / multi-terms aggregations ( need to address some gaps like top n aggs, handling high cardinality terms etc ), the cases we found to be the most performant .

@jainankitk
Copy link
Collaborator

@bharath-techie - Thank you for patiently addressing my comments and sharing some numbers with multi-field aggregation.

Good point. We've not locked in the implementation for date - we could also go with just supporting say 'minute' and then deriving higher time aggregations during query time if its performant

We can't go about storing just 'minute' right? Because then it needs to be separate bucket for every minute for different hour, day, year. I might be wrong since my understanding of star tree is limited.

and also we can have multiple star trees for different time histograms.

Yeah, this is an option. But again, we need to be careful about how much are we asking the customers to configure. Ideally, we would like to have workload tailored indices for most optimal performance, but we choose BKD, term dictionaries because they work for most use cases and customer need not worry about them

Our opensearch/lucene queries are already so optimized, but for most of the complex aggregations, efficiency will be capped based on the amount of docs to traverse/collect, which the star tree solves.

+1. Iterating through each of the documents is indeed very slow! :(

This is a just a case for POC, we can extend it to all buckets search currently supports. For instance, Hour bucket in POC gives same buckets as hourly histogram, that was the intention which can be extended to all the other granularities.

Maybe we can try POC covering some or most of the date histogram aggregations.

@bharath-techie
Copy link
Contributor Author

We can't go about storing just 'minute' right? Because then it needs to be separate bucket for every minute for different hour, day, year.

Not sure I fully understand but to elaborate on the idea - during query time, we can get docs of all the nodes of 'minute' dimension and during collection of the docs as part of 'aggregation' we bucketize it as per the histogram requested.
The tradeoff is higher query latency but user can save storage space. ( as we need to go through all 'minute' dimension nodes and hence collect/traverse higher number of docs )

we need to be careful about how much are we asking the customers to configure

Ya +1 , we will integrate with templates initially and also other such mechanisms to suggest the best configuration to the users. And multiple star trees is something we will incrementally expose to customers with base functionalities already built to support them.

we can try POC covering some or most of the date histogram aggregations.

Do you see any challenges - If rounding off timestamp while building star tree index is of same logic as of query, we should be able to do it right ?

@jainankitk
Copy link
Collaborator

jainankitk commented May 20, 2024

Not sure I fully understand but to elaborate on the idea - during query time, we can get docs of all the nodes of 'minute' dimension and during collection of the docs as part of 'aggregation' we bucketize it as per the histogram requested.

I am assuming you want to have, say only status and minute dimensions from above example. In that case, the buckets for minute dimension cannot be only 0-59 or 1-60. Because you need to further qualify each of those minutes into hours/day. Else, how can I answer query like minutely aggregation on range query from 2024-01-01 to 2024-01-02 without iterating over all the documents?

@bharath-techie
Copy link
Contributor Author

In that case, the buckets for minute dimension cannot be only 0-59 or 1-60.

Ah I get your question , I'm not making the minute to 0-59 but rather I'm rounding off the epoch millis/seconds to epoch minute.
We'll again be able to round off epoch minute to epoch hour or higher time dimensions during aggregation and bucketize as per the query.

Code reference in star tree

Rounding directly based on this in datehistogram . In fact we can try reusing this same class if possible.

Apart from this, I think there is a way to give fixed interval as well in histogram, which we'll need to evaluate.

@jainankitk
Copy link
Collaborator

Ah I get your question , I'm not making the minute to 0-59 but rather I'm rounding off the epoch millis/seconds to epoch minute.

But doesn't make the cardinality "too high". Just thinking about few months/yearly log data, the cardinality becomes ~100k for 2 months

Rounding directly based on this in datehistogram . In fact we can try reusing this same class if possible.

Rounding is slightly expensive operation, still much faster than the individual document iteration though. Maybe we can have multi range traversal logic on star tree similar to apache/lucene#13335, assuming we can traverse the minutely rounded epochs in sorted order

@sarthakaggarwal97
Copy link
Contributor

sarthakaggarwal97 commented May 21, 2024

But doesn't make the cardinality "too high". Just thinking about few months/yearly log data, the cardinality becomes ~100k for 2 months

A lot of users with timeseries type workload make a per day index, something like logs-24-11-2024. Since each index will have it's own star tree for the set of dimension/metrics, it will come down to the query part to handle these aggregation. Also, I feel with minutes is one option, but we can explore different ideas to handle queries at different time granularity.

I think from where @bharath-techie is coming from, we can still sort of control the cardinality if, in the dimension split order, we are keeping the timestamp at a higher level than other dimensions.

Something like this: timestamp -> status -> client_ip. If we do it the other way round (timestamp last), we would see a lot more branching.

@bharath-techie
Copy link
Contributor Author

bharath-techie commented May 21, 2024

Yes +1 , choosing the right dimensionSplitOrder controls the storage size.

But most importantly, in a particular segment - what will be the timestamp's cardinality ? For log/time series use cases, as long as we have efficient merge policies ( example : context aware segments ) , most likely we will only have few hours of data in a segment.
So even if an index spans several months, in each segment cardinality of the timestamp field will be quite less.

Maybe we can have multi range traversal logic on star tree similar to apache/lucene#13335,

Right now 'star tree' will act as the lead iterator for the query - so far in the design , we have star tree + doc values. So once star tree gives a set of documents for the dimensions which are in tree, we further filter out in doc values if there are any other dimensions in filter. ( not part of star tree as per maxLeafDocs threshold )

Sometime back, we evaluated the idea of using point tree as well for range queries alongside star tree but point tree works well as a lead iterator rather right ?

@jainankitk
Copy link
Collaborator

Something like this: timestamp -> status -> client_ip. If we do it the other way round (timestamp last), we would see a lot more branching.
Yes +1 , choosing the right dimensionSplitOrder controls the storage size.

I am not completely sure what the right branching is, especially between timestamp/status. For example, if we have uniform distribution of status across timestamps, won't we have equal number of buckets irrespective of the order of timestamp/status? Also, maybe status first is better idea because you can quickly answer the count queries on just status code? Else you need to go through all the timestamp values?

Sometime back, we evaluated the idea of using point tree as well for range queries alongside star tree but point tree works well as a lead iterator rather right ?

Did we evaluate for the case of query and aggregation being on same field or even different ones?

But most importantly, in a particular segment - what will be the timestamp's cardinality ? For log/time series use cases, as long as we have efficient merge policies ( #13183 ) , most likely we will only have few hours of data in a segment.

Sorry my bad, overlooked the generation being at segment level while doing my analysis.

@bharath-techie
Copy link
Contributor Author

bharath-techie commented May 24, 2024

Also, maybe status first is better idea because you can quickly answer the count queries on just status code? Else you need to go through all the timestamp values?

The paper highly recommends that we order the dimensions from highest cardinality to lowest cardinality for better pruning during query.

That said, we did benchmarks with both during POC , with status first we noticed significantly higher storage size. ( > 70% if I remember correctly , since timestamp which has high cardinality is present across each status)

And from query perspective , status first indeed performed better for status specific aggs [without any range filter ] but for range queries and date histogram we either found timestamp first to be faster or within a 5-10% range off from status first performance.

Query Status first throughput Timestamp first throughput
200s in range 91.74 ops/s 134.45 ops/s
400s in range 154.12 ops/s 139.19 ops/s
400s-agg 186 ops/s 112.82 ops/s
200s-agg 177 ops/s 85.45 ops/s

This is something we will continue to tune based on benchmarks / load testing before we rollout to production.

Note : the OSB benchmarks provide out of order data , so timestamp has super high cardinality in all segments. Probably when timestamp has lower cardinality similar to production time-series workloads or when coupled with context aware segments for example, we ideally can put the user dimensions at top and timestamp columns below it.

Did we evaluate for the case of query and aggregation being on same field or even different ones?

We usually evaluate with query and aggs being on different fields, but I get where you're coming from based on histogram improvements.
We just bounced off ideas, didn't get around to any implementation. Because we found star tree traversal to be decently fast - on a query on multiple dimensions especially - the latency again mainly came from doc values collection.

@jainankitk
Copy link
Collaborator

This is something we will continue to tune based on benchmarks / load testing before we rollout to production.

Agreed. Needs more experimentation.

Note : the OSB benchmarks provide out of order data , so timestamp has super high cardinality in all segments. Probably when timestamp has lower cardinality similar to production time-series workloads or when coupled with context aware segments for example, we ideally can put the user dimensions at top and timestamp columns below it.

Good point, I keep forgetting about that. I am assuming that is because the sorted input data is partitioned to multiple threads?

@bharath-techie
Copy link
Contributor Author

Good point, I keep forgetting about that. I am assuming that is because the sorted input data is partitioned to multiple threads?

Yes that's right. But, looks like folks run multiple instances of benchmark with numclients:1 to bypass this limitation. But so far all benchmarks are done without such modifications.

@bharath-techie
Copy link
Contributor Author

We are starting with implementation of this RFC and the meta issue tracks the list of issues and milestones planned for near future.

Please feel free to contribute / review / comment on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Performance This is for any performance related enhancements or bugs RFC Issues requesting major changes Roadmap:Cost/Performance/Scale Project-wide roadmap label Search:Aggregations v2.17.0
Projects
Status: New
Status: Later (6 months plus)
Development

No branches or pull requests