Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Performance regression in top_hits aggregation #1647

Closed
andrross opened this issue Dec 2, 2021 · 21 comments
Closed

[BUG] Performance regression in top_hits aggregation #1647

andrross opened this issue Dec 2, 2021 · 21 comments
Assignees
Labels
bug Something isn't working Search:Aggregations

Comments

@andrross
Copy link
Member

andrross commented Dec 2, 2021

I believe I have isolated a regression in performance of the top_hits aggregation between ES 7.9 and ES 7.10.2 (which means the regression is present in OpenSearch 1.0 as well).

To Reproduce
I added the following operation to the default nyc_taxis track in es rally:

{
  "name": "top_hits_agg",
  "warmup-iterations": 10,
  "iterations": 50,
  "clients": 2,
  "operation": {
    "operation-type": "search",
    "body": {
      "size": 0,
      "query": {
        "range": {
          "dropoff_datetime": {
            "from": "2015-01-01 00:00:00",
            "to": "2015-01-01 12:00:00"
          }
        }
      },
      "aggs": {
        "2": {
          "terms": {
            "field": "payment_type",
            "size": 1000
          },
          "aggregations": {
            "3": {
              "date_histogram": {
                "field": "dropoff_datetime",
                "fixed_interval": "5s"
              },
              "aggregations": {
                "1": {
                  "top_hits": {
                    "size": 1,
                    "sort": [
                      {
                        "dropoff_datetime": {
                          "order": "desc"
                        }
                      }
                    ]
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

After populating both a 7.9 and 7.10.2 cluster with the nyc_taxis dataset, I then ran a race with the following commands:

esrally race --pipeline=benchmark-only --distribution-version=7.9.0 --track=nyc_taxis --target-host=127.0.0.1:9790 --report-file=es_790_report_$(date +%s) --user-tag='version:7.9,race:top_hits_and_date' --include-tasks=top_hits_agg,date_histogram_agg

esrally race --pipeline=benchmark-only --distribution-version=7.10.2 --track=nyc_taxis --target-host=127.0.0.1:9710 --report-file=es_710_report_$(date +%s) --user-tag='version:7.10.2,race:top_hits_and_date' --include-tasks=top_hits_agg,date_histogram_agg

Expected behavior
The expected behavior is that there will be very similar performance for the top_hits aggregation between the two versions of the software. The following data compares ES7.9 as baseline and ES7.10.2 as the contender. As expected, the performance of the date_histogram_agg operation is more-or-less at parity between the two. However, the top_hits aggregation performance is consistently 20-30% worse in 7.10.2 than 7.9:

Metric Task Baseline Contender Diff Unit
Cumulative indexing time of primary shards 125.509 110.375 -15.1333 min
Min cumulative indexing time across primary shard 125.509 110.375 -15.1333 min
Median cumulative indexing time across primary shard 125.509 110.375 -15.1333 min
Max cumulative indexing time across primary shard 125.509 110.375 -15.1333 min
Cumulative indexing throttle time of primary shards 0 0 0 min
Min cumulative indexing throttle time across primary shard 0 0 0 min
Median cumulative indexing throttle time across primary shard 0 0 0 min
Max cumulative indexing throttle time across primary shard 0 0 0 min
Cumulative merge time of primary shards 41.7306 47.5379 5.80733 min
Cumulative merge count of primary shards 93 89 -4
Min cumulative merge time across primary shard 41.7306 47.5379 5.80733 min
Median cumulative merge time across primary shard 41.7306 47.5379 5.80733 min
Max cumulative merge time across primary shard 41.7306 47.5379 5.80733 min
Cumulative merge throttle time of primary shards 4.12925 4.25008 0.12083 min
Min cumulative merge throttle time across primary shard 4.12925 4.25008 0.12083 min
Median cumulative merge throttle time across primary shard 4.12925 4.25008 0.12083 min
Max cumulative merge throttle time across primary shard 4.12925 4.25008 0.12083 min
Cumulative refresh time of primary shards 2.60575 2.36 -0.24575 min
Cumulative refresh count of primary shards 82 79 -3
Min cumulative refresh time across primary shard 2.60575 2.36 -0.24575 min
Median cumulative refresh time across primary shard 2.60575 2.36 -0.24575 min
Max cumulative refresh time across primary shard 2.60575 2.36 -0.24575 min
Cumulative flush time of primary shards 5.45558 5.48922 0.03363 min
Cumulative flush count of primary shards 35 34 -1
Min cumulative flush time across primary shard 5.45558 5.48922 0.03363 min
Median cumulative flush time across primary shard 5.45558 5.48922 0.03363 min
Max cumulative flush time across primary shard 5.45558 5.48922 0.03363 min
Total Young Gen GC time 0.258 1.24 0.982 s
Total Young Gen GC count 32 136 104
Total Old Gen GC time 0 0 0 s
Total Old Gen GC count 0 0 0
Store size 25.0414 24.3655 -0.67586 GB
Translog size 5.12227e-08 5.12227e-08 0 GB
Heap used for segments 0.193222 0.0988235 -0.0944 MB
Heap used for doc values 0.0354309 0.034874 -0.00056 MB
Heap used for terms 0.0408325 0.0370789 -0.00375 MB
Heap used for norms 0 0 0 MB
Heap used for points 0 0 0 MB
Heap used for stored fields 0.116959 0.0268707 -0.09009 MB
Segment count 32 30 -2
Min Throughput date_histogram_agg 1.50023 1.50023 -1e-05 ops/s
Median Throughput date_histogram_agg 1.50035 1.50035 -0 ops/s
Max Throughput date_histogram_agg 1.50054 1.50054 0 ops/s
50th percentile latency date_histogram_agg 548.563 549.301 0.73863 ms
90th percentile latency date_histogram_agg 557.053 559.417 2.36497 ms
99th percentile latency date_histogram_agg 571.816 569.581 -2.23542 ms
100th percentile latency date_histogram_agg 578.69 575.319 -3.37139 ms
50th percentile service time date_histogram_agg 547.429 548.28 0.85141 ms
90th percentile service time date_histogram_agg 555.608 558.266 2.65809 ms
99th percentile service time date_histogram_agg 570.955 568.385 -2.57038 ms
100th percentile service time date_histogram_agg 577.466 573.309 -4.1576 ms
error rate date_histogram_agg 0 0 0 %
Min Throughput top_hits_agg 0.720015 0.559013 -0.161 ops/s
Median Throughput top_hits_agg 0.747501 0.577278 -0.17022 ops/s
Max Throughput top_hits_agg 0.759315 0.582273 -0.17704 ops/s
50th percentile latency top_hits_agg 2569.6 3373.35 803.747 ms
90th percentile latency top_hits_agg 2609.98 3455.39 845.411 ms
99th percentile latency top_hits_agg 2622.45 3502.29 879.836 ms
100th percentile latency top_hits_agg 2694.13 3503.48 809.352 ms
50th percentile service time top_hits_agg 2569.6 3373.35 803.747 ms
90th percentile service time top_hits_agg 2609.98 3455.39 845.411 ms
99th percentile service time top_hits_agg 2622.45 3502.29 879.836 ms
100th percentile service time top_hits_agg 2694.13 3503.48 809.352 ms
error rate top_hits_agg 0 0 0 %

Plugins
None

Host/Environment (please complete the following information):

  • OS: Ubuntu 20.04
  • Version ES 7.9 and ES 7.10.2
  • Host type: c6i.8xlarge

Additional context
The elasticsearch distributions are just the stock versions downloaded as tar files, extracted and started with the following command:

ES_JAVA_OPTS="-Xms8g -Xmx8g" ./bin/elasticsearch

The only non-default setting in the yml configuration file is the port number. Both instances are running on the same machine, though there load is only driven one at a time (not concurrently). The service is bound to localhost and es rally is running on the same machine.

@andrross andrross added bug Something isn't working untriaged labels Dec 2, 2021
@reta
Copy link
Collaborator

reta commented Dec 2, 2021

@andrross do you mind to check the return of the top_hits_agg for both distributions? I suspect we may run into elastic/elasticsearch#65624 or elastic/elasticsearch#67043, thank you.

@andrross
Copy link
Member Author

andrross commented Dec 3, 2021

I do indeed get different results for the query in question here, so it looks like the referenced bug may be a factor.

@nknize
Copy link
Collaborator

nknize commented Dec 3, 2021

I'm curious about the referenced bug fixes. While they are something we should consider looking at, those fixes are related to nested documents. But it looks like that doesn't apply to the nyc_taxi data?

@reta
Copy link
Collaborator

reta commented Dec 3, 2021

@nknize I was just checking the recent changes (and bugfixes) related to aggregation, you may be right, but the difference in aggregation results returned between these two runs (as confirmed by @andrross) is exactly what was observed in elastic/elasticsearch#65624

@andrross
Copy link
Member Author

andrross commented Dec 3, 2021

I think I spoke too soon about the results being wrong. The results are indeed different, but in this case the query is looking for the "top hit" on the dropoff_datetime field. In the case producing the difference, the data contains multiple entries with equivalent dropoff_datetime values and the different versions of the software end up returning a different document. I don't think that is necessarily indicative of a problem though.

@reta
Copy link
Collaborator

reta commented Dec 3, 2021

Ah ... thanks @andrross , we probably could just profile this particular aggregation using 7.9 and 7.10 via curl or alike, may be something obvious is going to surface right away.

@andrross
Copy link
Member Author

andrross commented Dec 6, 2021

I profiled ES7.9 versus OpenSearch 1.2. It looks like OS1.2 is possibly spending more time doing decompression.

ES7.9

OpenSearch 1.2

@andrross
Copy link
Member Author

andrross commented Dec 6, 2021

Also, I've tested this "top hits" query against ES7.11 and ES7.15 and both appear to exhibit the regressions relative to ES7.9.

@reta
Copy link
Collaborator

reta commented Dec 6, 2021

Interesting, seems to be https://issues.apache.org/jira/browse/LUCENE-9486, ES 7.9 is using Lucene 8.6 whereas 7.10 is on 8.7 where this feature has been delivered.

Update: Basically, runs down to: LZ4(16kB) (current BEST_SPEED) vs Deflate with preset dict (new BEST_COMPRESSION) and the benchmarks on the ticket do show significant difference impact on fetch time.

@dblock dblock changed the title [BUG] Preformance regression in top_hits aggregation [BUG] Performance regression in top_hits aggregation Dec 6, 2021
@andrross
Copy link
Member Author

andrross commented Dec 6, 2021

@reta Is there something peculiar about the top hits query that causes the higher fetch times to manifest as significantly slower queries? Why are the other query-focused performance tests not also impacted?

@reta
Copy link
Collaborator

reta commented Dec 6, 2021

@andrross very good question but I don't know to be honest, what is interesting is that the aggregation is using global ordinals which kick in particular for keyword fields, would you mind to share the complete query snippet behind top_hits? thank you . (sorry missed that it is already in description)

@reta
Copy link
Collaborator

reta commented Dec 6, 2021

Relevant discussions on mailing lists and tickets:

@andrross ES 7.15.x still uses Lucene 8.9, the issue seems to be addressed (or at least, mitigated) in Lucene 8.10, I think we could benchmark against 2.0.0-SNAPSHOT (which is using Lucene 8.10.1), wdyt?

@andrross
Copy link
Member Author

andrross commented Dec 6, 2021

@reta I've been benchmarking against OpenSearch 1.2, which is using Lucene 8.10.1 and it shows comparable performance to all the ES versions greater than 7.9.

Here's the quick test I'm running against the 4 versions on my machine (all populated with the nyc_taxi dataset but otherwise idle). It consistently produces results like the following:

es7.9      "took" : 2396,
es7.10.2   "took" : 3013,
es7.15.2   "took" : 3068,
os1.2      "took" : 3007,
------
es7.9      "took" : 2393,
es7.10.2   "took" : 3012,
es7.15.2   "took" : 3049,
os1.2      "took" : 2999,
------
es7.9      "took" : 2386,
es7.10.2   "took" : 3009,
es7.15.2   "took" : 3055,
os1.2      "took" : 3009,
------

@reta
Copy link
Collaborator

reta commented Dec 7, 2021

@andrross so the change came with 31c026f where the upgrade to Lucene 8.7.0 happened and basically switched to new compression.

I did experiment and rebuilt OpenSearch 2.0 SNASPHOT with Lucene86Codec, the latencies went back to Elasticsearch 7.9.x levels:

Elasticsearch 7.9.x

Metric Task Run #1 Run #2 Unit
Min Throughput top_hits_agg 0.36 0.36 ops/s
Mean Throughput top_hits_agg 0.38 0.38 ops/s
Median Throughput top_hits_agg 0.38 0.39 ops/s
Max Throughput top_hits_agg 0.38 0.39 ops/s
50th percentile latency top_hits_agg 69333.1 67921.7 ms
90th percentile latency top_hits_agg 107202 105145 ms
100th percentile latency top_hits_agg 116503 114215 ms
50th percentile service time top_hits_agg 2507.79 2469.64 ms
90th percentile service time top_hits_agg 2541.37 2520.94 ms
100th percentile service time top_hits_agg 2661.41 2541.38 ms
error rate top_hits_agg 0 0 %

Opensearch 2.0 (SNAPSHOT)

Metric Task Run #1 Run #2 Unit
Min Throughput top_hits_agg 0.27 0.28 ops/s
Mean Throughput top_hits_agg 0.29 0.29 ops/s
Median Throughput top_hits_agg 0.29 0.29 ops/s
Max Throughput top_hits_agg 0.29 0.3 ops/s
50th percentile latency top_hits_agg 97998.2 96045.2 ms
90th percentile latency top_hits_agg 150750 151238 ms
100th percentile latency top_hits_agg 164118 164771 ms
50th percentile service time top_hits_agg 3238.32 3251.69 ms
90th percentile service time top_hits_agg 3284.75 3449.24 ms
100th percentile service time top_hits_agg 3309.06 3542.48 ms
error rate top_hits_agg 0 0 %

Opensearch 2.0 (SNAPSHOT) with Lucene86Codec

Metric Task Run #1 Run #2 Unit
Min Throughput top_hits_agg 0.33 0.29 ops/s
Mean Throughput top_hits_agg 0.35 0.33 ops/s
Median Throughput top_hits_agg 0.35 0.34 ops/s
Max Throughput top_hits_agg 0.35 0.34 ops/s
50th percentile latency top_hits_agg 77534.8 79022.5 ms
90th percentile latency top_hits_agg 118762 120599 ms
100th percentile latency top_hits_agg 129576 131161 ms
50th percentile service time top_hits_agg 2679.91 2672.98 ms
90th percentile service time top_hits_agg 2759 2714.66 ms
100th percentile service time top_hits_agg 2869.71 2828.59 ms
error rate top_hits_agg 0 0 %

I actually don't know what we should be doing in this case, the change seems to be related to the Lucene87Codec implementation.

@andrross
Copy link
Member Author

andrross commented Dec 8, 2021

Thanks @reta ! I was just working on setting up a similar experiment this morning :)

I'm also not sure what we should be doing here. It looks like there was an intentional decision to increase compression performance that the cost of slower reads in some cases, and for whatever reason this case seems to be particularly bad with the new codec.

@reta
Copy link
Collaborator

reta commented Dec 8, 2021

I think the one thing we could certainly do is to document our findings in the official documentation. The other thing we could do it to benchmark / profile Lucene87Codec to understand where this overhead is coming from and why. Also, another option to consider could be to allow tuning of the codec on Opensearch side, not sure how useful it would be in practice. @nknize @dblock do you guys have some thoughts on the matter? Thank you!

@dblock
Copy link
Member

dblock commented Dec 9, 2021

Is there an option to make compression level configurable at node/cluster level?

Generally, documenting is always a good idea. Other ideas include a blog post that posts the above benchmarks, explains the trade-off and why it was a good one.

@reta
Copy link
Collaborator

reta commented Dec 9, 2021

thanks @dblock

Is there an option to make compression level configurable at node/cluster level?

Not at the moment, OpenSearch uses the latest Lucene Codec: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/codec/CodecService.java#L63

Generally, documenting is always a good idea. Other ideas include a blog post that posts the above benchmarks, explains the trade-off and why it was a good one.

👍 I think we need a bit of work here on OpenSearch side to match the Apache Lucene findings

@andrross
Copy link
Member Author

I believe this Lucene issue actually does fix the regression. I didn't realize that the testing tool was setting BEST_COMPRESSION by default for the given workload, but when I run the test against in index that has been configured with index.codec=default then the regression nearly completely disappears.

Test script:

$ ./top-hits-test.sh
es7.9      "took" : 737,
os1.0      "took" : 2707,
os1.2      "took" : 961,
------
es7.9      "took" : 744,
os1.0      "took" : 2700,
os1.2      "took" : 933,
------
es7.9      "took" : 742,
os1.0      "took" : 2703,
os1.2      "took" : 915,
------

Note that BEST_COMPRESSION is in fact quite a bit slower on new versions as compared to ES 7.9, but that's because the new Lucene codec has been changed to significantly improve compression at the cost of slower performance. The performance change is minimal when comparing the versions using the default codec (BEST_SPEED).

@andrross
Copy link
Member Author

To be clear, the fact that the Lucene change mitigates the regression means that OpenSearch 1.2 has the fix because it uses Lucene 8.10.1 and no action is needed to address this within OpenSearch. I think we can close this issue. However, there are a few things we should consider:

  • Document the change in performance of BEST_COMPRESSION (better compression ratio, slower fetch) so anyone coming from ES 7.9 or older to OpenSearch can be aware of the tradeoffs.
  • Ensure that we have performance tests that stress the performance of fetching documents from source so that we can detect changes in performance in this area in future releases and Lucene upgrades.
  • Add profiling to the fetch phase of searches. The flame graph of the JVM profile taken using Java Flight Recorder pointed me to the fetch phase and decompression for this issue, but if the OpenSearch profile data itself had detailed stats on the fetch phase, then it would have been obvious much more quickly without needing to profile the JVM itself.

I'll close this issue and create separate issues for the above items unless anyone has objections or additional comments.

@reta
Copy link
Collaborator

reta commented Dec 16, 2021

Document the change in performance of BEST_COMPRESSION (better compression ratio, slower fetch) so anyone coming from ES 7.9 or older to OpenSearch can be aware of the tradeoffs.

👍

Ensure that we have performance tests that stress the performance of fetching documents from source so that we can detect changes in performance in this area in future releases and Lucene upgrades.

Should be part of #1276, I really like the way Apache Lucene and Elasticsearch guys publish the benchmarks every day.

Add profiling to the fetch phase of searches. The flame graph of the JVM profile taken using Java Flight Recorder pointed me to the fetch phase and decompression for this issue, but if the OpenSearch profile data itself had detailed stats on the fetch phase, then it would have been obvious much more quickly without needing to profile the JVM itself.

👍 this is interesting but would certainly help to detect issues like that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Aggregations
Projects
None yet
Development

No branches or pull requests

5 participants