Possible regression of date histogram starting from 2.12 #13171

bowenlan-amzn · 2024-04-11T23:02:27Z

GitHub issue: #13087
pmc workload: https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/pmc

Regression query
note: by default, the request cache is disabled on the target index

{
      "name": "articles_monthly_agg_uncached",
      "operation-type": "search",
      "body": {
        "size": 0,
        "aggs": {
          "articles_over_time": {
            "date_histogram": {
              "field": "timestamp",
              "calendar_interval": "month"
            }
          }
        }
      }
    }

Reproduce the regression

1 r6g.xlarge node, 5p0r
Using the opensearch benchmark tool

rerun 3 times, get relatively consistent results. One example result:

|                                                 Min Throughput | articles_monthly_agg_uncached |       14.77 |  ops/s |
|                                                Mean Throughput | articles_monthly_agg_uncached |       14.79 |  ops/s |
|                                              Median Throughput | articles_monthly_agg_uncached |       14.79 |  ops/s |
|                                                 Max Throughput | articles_monthly_agg_uncached |        14.8 |  ops/s |
|                                        50th percentile latency | articles_monthly_agg_uncached |     10604.5 |     ms |
|                                        90th percentile latency | articles_monthly_agg_uncached |     12003.5 |     ms |
|                                        99th percentile latency | articles_monthly_agg_uncached |       12364 |     ms |
|                                       100th percentile latency | articles_monthly_agg_uncached |     12400.4 |     ms |
|                                   50th percentile service time | articles_monthly_agg_uncached |     67.1243 |     ms |
|                                   90th percentile service time | articles_monthly_agg_uncached |     71.8665 |     ms |
|                                   99th percentile service time | articles_monthly_agg_uncached |     76.3119 |     ms |
|                                  100th percentile service time | articles_monthly_agg_uncached |     78.9056 |     ms |
|                                                     error rate | articles_monthly_agg_uncached |           0 |      % |

For 2.11, the result is

|                                                 Min Throughput | articles_monthly_agg_uncached |       20.01 |   ops/s |
|                                                Mean Throughput | articles_monthly_agg_uncached |       20.01 |   ops/s |
|                                              Median Throughput | articles_monthly_agg_uncached |       20.01 |   ops/s |
|                                                 Max Throughput | articles_monthly_agg_uncached |       20.01 |   ops/s |
|                                        50th percentile latency | articles_monthly_agg_uncached |     18.1556 |      ms |
|                                        90th percentile latency | articles_monthly_agg_uncached |     18.5772 |      ms |
|                                        99th percentile latency | articles_monthly_agg_uncached |     18.6992 |      ms |
|                                       100th percentile latency | articles_monthly_agg_uncached |     18.8191 |      ms |
|                                   50th percentile service time | articles_monthly_agg_uncached |     17.4722 |      ms |
|                                   90th percentile service time | articles_monthly_agg_uncached |     17.5867 |      ms |
|                                   99th percentile service time | articles_monthly_agg_uncached |     17.7151 |      ms |
|                                  100th percentile service time | articles_monthly_agg_uncached |     17.9092 |      ms |

Note: service time is the query took time.

Impact

The code for the optimization was released in 2.12.
The pre-conditions for this optimization include:

the date histogram doesn’t have parent or sub aggregation
the query part is in effect a match all or a range query on same time field
total bucket number < 1024

Profiling results

Using async profiler, 60s when sending the query in a while true loop

2.11

https://github.com/bowenlan-amzn/file-share/blob/75bc67384e4c9fe83792ed1f41b7d49e95768281/211flamegraph.html (GH issue doesn't support html upload, you may need to download and see)

2.12

https://github.com/bowenlan-amzn/file-share/blob/75bc67384e4c9fe83792ed1f41b7d49e95768281/212flamegraph.html

Qualitative Analysis

By default, aggregation iterate over the documents and poll the value of the aggregated field from DocValues and bucketize it. From the profiling result of 2.11, the time consuming parts are reading doc value, doing rounding of the value and adding the value to the results.8

In our optimization, we don’t iterate doc values. Instead we create the bucket ranges first. Then use PointRangeQuery to get the doc count for each bucket. From the profiling result of 2.12, the time consuming part is the ponit count of PointRangeQuery. And inside it, the visitDocValues method takes most of the time. This visitDocValues is to iterate the points of leaf node in BKD and this is the most heavy part when using BKD.

With respect to doing aggregation on one segment, first define some expressions for later deduction

N_R: number of ranges or buckets
N_C: number of leaf nodes that need visitDocValues.
N_Leaf: total number of leaf nodes that have matching documents
N_Doc: total number of documents that matches

The approximate time cost of default method
T_default = C_default x N_Doc
C_default is the cost of handling a doc in default method or DocValues iteration, the heavy part is related to how doc value is read from the file and get added into the result.

The approximate time cost of optimized method
T_optim = C_optim x ((N_C/N_Leaf) x N_Doc) C_optim is the cost of handling a doc in optimized method or BKD traversal, the heavy part is related to how point value is read read from leaf nodes and a good way to count them into the result.

The problem is to decide when to use the optimized method

First important element is N_C. N_C is related to N_R, generally more ranges of smaller interval lead to more leaf node visitDocValues. Considering the ranges rewritten from date histogram are connected with each other, we are probably traverse many leaf nodes twice, while ideally we should only do it once. I am looking into a way to not doing range query one by one but just one bkd traversal for all ranges. We cannot avoid traverse some leaf nodes when using the optimized method, so how to calculate the N_C/N_Leaf is the key. In 1-D scenario, assume we are doing multi-range traversal, N_C is bounded by N_R+1.

Now the question become how to estimate N_Leaf, if it's a segment level match all scenario, N_Leaf is about (segment's maxDoc/512). If the top level query has a range query, N_leaf calculation is not straightforward. There's an existing light-weighted method of PointValues, estimatePointCount, that can tell how many points falls in the range. Similarly, we can also write a custom intersector to tell how many leaf nodes fall into the range query of top level query. Since this won't involve any leaf node visitDocValues, only tree traversal, I don't see a performance issue here.

On the other hand, I distinguish C_default and C_optim because they may have noticeable difference. Currently in IndexOrDocValuesQuery, C_default = 8 x C_optim, which means using DV is considered 8 times costly than using BKD in terms of verifing the matched documents of the lead iterator in a conjunction. The number is probably different in our case since the logic here are very different. We will carry out some experiments to get their ratio.

First Experiment

The pmc dataset has 123 months (2006/2/1 ~ 2016/4/1) of data.
Based on the approximate time cost formulas, we can see for the optimized method, the monthly aggregation leads to a higher N_R. Considering the time field is 1 dimensional, an approximate of N_C is 2xN_R. So if we decrease the N_R by increasing the time interval of the agg query, we should observe the performance of optimized method becomes better.

I increase the time interval of the aggregation query from month to year and 1000d, and manually run the query multiple times against the cluster. This is the query took time in ms.

	month	180d	year	1000d
default	16	11	12	10
optimized	63	16	8	4

Fix Plan

Short Term

Update the current threshold 1024 to 24, which is the a relatively safer number in terms of pmc workload.
Add a cluster setting that can be used to easily turn off the optimization. #13179

Long Term

Develop a clever BKD traversal that only traverse once for any number of bucket ranges and able to count the points in a light weight. #13317

The text was updated successfully, but these errors were encountered:

Short term solution for pmc workload regression opensearch-project#13171 Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

rishabh6788 · 2024-04-19T01:43:03Z

As mentioned we are seeing huge spike in date_histogram_hourly_agg query in big5 workload.

bowenlan-amzn added this to Performance Roadmap Apr 11, 2024

bowenlan-amzn converted this from a draft issue Apr 11, 2024

bowenlan-amzn self-assigned this Apr 11, 2024

github-actions bot added the untriaged label Apr 11, 2024

bowenlan-amzn added bug Something isn't working Search:Aggregations Performance This is for any performance related enhancements or bugs and removed untriaged labels Apr 11, 2024

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking and Search Project Board Apr 11, 2024

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Apr 11, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Apr 11, 2024

bowenlan-amzn mentioned this issue Apr 13, 2024

Add cluster setting to dynamically disable filter rewrite optimization #13179

Merged

7 tasks

bowenlan-amzn mentioned this issue Apr 21, 2024

Support multi ranges traversal when doing date histogram rewrite optimization #13317

Merged

7 tasks

msfroh closed this as completed in #13317 May 3, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board May 3, 2024

github-project-automation bot moved this from Open to Closed in OpenSearch Lucene & Core Performance Tracking May 3, 2024

github-project-automation bot moved this from In Progress to Done in Performance Roadmap May 3, 2024

bowenlan-amzn mentioned this issue Jun 18, 2024

[Profiling deep dive] Default aggregation vs. optimization code path #14438

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible regression of date histogram starting from 2.12 #13171

Possible regression of date histogram starting from 2.12 #13171

bowenlan-amzn commented Apr 11, 2024 •

edited

Loading

rishabh6788 commented Apr 19, 2024

Possible regression of date histogram starting from 2.12 #13171

Possible regression of date histogram starting from 2.12 #13171

Comments

bowenlan-amzn commented Apr 11, 2024 • edited Loading

Reproduce the regression

Impact

Profiling results

Qualitative Analysis

First Experiment

Fix Plan

rishabh6788 commented Apr 19, 2024

bowenlan-amzn commented Apr 11, 2024 •

edited

Loading