[BUG] Performance regression in `top_hits` aggregation #1647

andrross · 2021-12-02T19:15:50Z

I believe I have isolated a regression in performance of the top_hits aggregation between ES 7.9 and ES 7.10.2 (which means the regression is present in OpenSearch 1.0 as well).

To Reproduce
I added the following operation to the default nyc_taxis track in es rally:

{
  "name": "top_hits_agg",
  "warmup-iterations": 10,
  "iterations": 50,
  "clients": 2,
  "operation": {
    "operation-type": "search",
    "body": {
      "size": 0,
      "query": {
        "range": {
          "dropoff_datetime": {
            "from": "2015-01-01 00:00:00",
            "to": "2015-01-01 12:00:00"
          }
        }
      },
      "aggs": {
        "2": {
          "terms": {
            "field": "payment_type",
            "size": 1000
          },
          "aggregations": {
            "3": {
              "date_histogram": {
                "field": "dropoff_datetime",
                "fixed_interval": "5s"
              },
              "aggregations": {
                "1": {
                  "top_hits": {
                    "size": 1,
                    "sort": [
                      {
                        "dropoff_datetime": {
                          "order": "desc"
                        }
                      }
                    ]
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

After populating both a 7.9 and 7.10.2 cluster with the nyc_taxis dataset, I then ran a race with the following commands:

esrally race --pipeline=benchmark-only --distribution-version=7.9.0 --track=nyc_taxis --target-host=127.0.0.1:9790 --report-file=es_790_report_$(date +%s) --user-tag='version:7.9,race:top_hits_and_date' --include-tasks=top_hits_agg,date_histogram_agg

esrally race --pipeline=benchmark-only --distribution-version=7.10.2 --track=nyc_taxis --target-host=127.0.0.1:9710 --report-file=es_710_report_$(date +%s) --user-tag='version:7.10.2,race:top_hits_and_date' --include-tasks=top_hits_agg,date_histogram_agg

Expected behavior
The expected behavior is that there will be very similar performance for the top_hits aggregation between the two versions of the software. The following data compares ES7.9 as baseline and ES7.10.2 as the contender. As expected, the performance of the date_histogram_agg operation is more-or-less at parity between the two. However, the top_hits aggregation performance is consistently 20-30% worse in 7.10.2 than 7.9:

Metric	Task	Baseline	Contender	Diff	Unit
Cumulative indexing time of primary shards		125.509	110.375	-15.1333	min
Min cumulative indexing time across primary shard		125.509	110.375	-15.1333	min
Median cumulative indexing time across primary shard		125.509	110.375	-15.1333	min
Max cumulative indexing time across primary shard		125.509	110.375	-15.1333	min
Cumulative indexing throttle time of primary shards		0	0	0	min
Min cumulative indexing throttle time across primary shard		0	0	0	min
Median cumulative indexing throttle time across primary shard		0	0	0	min
Max cumulative indexing throttle time across primary shard		0	0	0	min
Cumulative merge time of primary shards		41.7306	47.5379	5.80733	min
Cumulative merge count of primary shards		93	89	-4
Min cumulative merge time across primary shard		41.7306	47.5379	5.80733	min
Median cumulative merge time across primary shard		41.7306	47.5379	5.80733	min
Max cumulative merge time across primary shard		41.7306	47.5379	5.80733	min
Cumulative merge throttle time of primary shards		4.12925	4.25008	0.12083	min
Min cumulative merge throttle time across primary shard		4.12925	4.25008	0.12083	min
Median cumulative merge throttle time across primary shard		4.12925	4.25008	0.12083	min
Max cumulative merge throttle time across primary shard		4.12925	4.25008	0.12083	min
Cumulative refresh time of primary shards		2.60575	2.36	-0.24575	min
Cumulative refresh count of primary shards		82	79	-3
Min cumulative refresh time across primary shard		2.60575	2.36	-0.24575	min
Median cumulative refresh time across primary shard		2.60575	2.36	-0.24575	min
Max cumulative refresh time across primary shard		2.60575	2.36	-0.24575	min
Cumulative flush time of primary shards		5.45558	5.48922	0.03363	min
Cumulative flush count of primary shards		35	34	-1
Min cumulative flush time across primary shard		5.45558	5.48922	0.03363	min
Median cumulative flush time across primary shard		5.45558	5.48922	0.03363	min
Max cumulative flush time across primary shard		5.45558	5.48922	0.03363	min
Total Young Gen GC time		0.258	1.24	0.982	s
Total Young Gen GC count		32	136	104
Total Old Gen GC time		0	0	0	s
Total Old Gen GC count		0	0	0
Store size		25.0414	24.3655	-0.67586	GB
Translog size		5.12227e-08	5.12227e-08	0	GB
Heap used for segments		0.193222	0.0988235	-0.0944	MB
Heap used for doc values		0.0354309	0.034874	-0.00056	MB
Heap used for terms		0.0408325	0.0370789	-0.00375	MB
Heap used for norms		0	0	0	MB
Heap used for points		0	0	0	MB
Heap used for stored fields		0.116959	0.0268707	-0.09009	MB
Segment count		32	30	-2
Min Throughput	date_histogram_agg	1.50023	1.50023	-1e-05	ops/s
Median Throughput	date_histogram_agg	1.50035	1.50035	-0	ops/s
Max Throughput	date_histogram_agg	1.50054	1.50054	0	ops/s
50th percentile latency	date_histogram_agg	548.563	549.301	0.73863	ms
90th percentile latency	date_histogram_agg	557.053	559.417	2.36497	ms
99th percentile latency	date_histogram_agg	571.816	569.581	-2.23542	ms
100th percentile latency	date_histogram_agg	578.69	575.319	-3.37139	ms
50th percentile service time	date_histogram_agg	547.429	548.28	0.85141	ms
90th percentile service time	date_histogram_agg	555.608	558.266	2.65809	ms
99th percentile service time	date_histogram_agg	570.955	568.385	-2.57038	ms
100th percentile service time	date_histogram_agg	577.466	573.309	-4.1576	ms
error rate	date_histogram_agg	0	0	0	%
Min Throughput	top_hits_agg	0.720015	0.559013	-0.161	ops/s
Median Throughput	top_hits_agg	0.747501	0.577278	-0.17022	ops/s
Max Throughput	top_hits_agg	0.759315	0.582273	-0.17704	ops/s
50th percentile latency	top_hits_agg	2569.6	3373.35	803.747	ms
90th percentile latency	top_hits_agg	2609.98	3455.39	845.411	ms
99th percentile latency	top_hits_agg	2622.45	3502.29	879.836	ms
100th percentile latency	top_hits_agg	2694.13	3503.48	809.352	ms
50th percentile service time	top_hits_agg	2569.6	3373.35	803.747	ms
90th percentile service time	top_hits_agg	2609.98	3455.39	845.411	ms
99th percentile service time	top_hits_agg	2622.45	3502.29	879.836	ms
100th percentile service time	top_hits_agg	2694.13	3503.48	809.352	ms
error rate	top_hits_agg	0	0	0	%

Plugins
None

Host/Environment (please complete the following information):

OS: Ubuntu 20.04
Version ES 7.9 and ES 7.10.2
Host type: c6i.8xlarge

Additional context
The elasticsearch distributions are just the stock versions downloaded as tar files, extracted and started with the following command:

ES_JAVA_OPTS="-Xms8g -Xmx8g" ./bin/elasticsearch

The only non-default setting in the yml configuration file is the port number. Both instances are running on the same machine, though there load is only driven one at a time (not concurrently). The service is bound to localhost and es rally is running on the same machine.

The text was updated successfully, but these errors were encountered:

reta · 2021-12-02T19:58:55Z

@andrross do you mind to check the return of the top_hits_agg for both distributions? I suspect we may run into elastic/elasticsearch#65624 or elastic/elasticsearch#67043, thank you.

andrross · 2021-12-03T00:07:11Z

I do indeed get different results for the query in question here, so it looks like the referenced bug may be a factor.

nknize · 2021-12-03T20:09:47Z

I'm curious about the referenced bug fixes. While they are something we should consider looking at, those fixes are related to nested documents. But it looks like that doesn't apply to the nyc_taxi data?

reta · 2021-12-03T20:20:00Z

@nknize I was just checking the recent changes (and bugfixes) related to aggregation, you may be right, but the difference in aggregation results returned between these two runs (as confirmed by @andrross) is exactly what was observed in elastic/elasticsearch#65624

andrross · 2021-12-03T20:38:21Z

I think I spoke too soon about the results being wrong. The results are indeed different, but in this case the query is looking for the "top hit" on the dropoff_datetime field. In the case producing the difference, the data contains multiple entries with equivalent dropoff_datetime values and the different versions of the software end up returning a different document. I don't think that is necessarily indicative of a problem though.

reta · 2021-12-03T21:12:58Z

Ah ... thanks @andrross , we probably could just profile this particular aggregation using 7.9 and 7.10 via curl or alike, may be something obvious is going to surface right away.

andrross · 2021-12-06T16:58:10Z

I profiled ES7.9 versus OpenSearch 1.2. It looks like OS1.2 is possibly spending more time doing decompression.

andrross · 2021-12-06T16:59:30Z

Also, I've tested this "top hits" query against ES7.11 and ES7.15 and both appear to exhibit the regressions relative to ES7.9.

reta · 2021-12-06T17:21:24Z

Interesting, seems to be https://issues.apache.org/jira/browse/LUCENE-9486, ES 7.9 is using Lucene 8.6 whereas 7.10 is on 8.7 where this feature has been delivered.

Update: Basically, runs down to: LZ4(16kB) (current BEST_SPEED) vs Deflate with preset dict (new BEST_COMPRESSION) and the benchmarks on the ticket do show significant difference impact on fetch time.

andrross · 2021-12-06T18:21:24Z

@reta Is there something peculiar about the top hits query that causes the higher fetch times to manifest as significantly slower queries? Why are the other query-focused performance tests not also impacted?

reta · 2021-12-06T18:57:59Z

@andrross very good question but I don't know to be honest, what is interesting is that the aggregation is using global ordinals which kick in particular for keyword fields, ~~would you mind to share the complete query snippet behind top_hits? thank you~~ . (sorry missed that it is already in description)

reta · 2021-12-06T20:50:39Z

Relevant discussions on mailing lists and tickets:

@andrross ES 7.15.x still uses Lucene 8.9, the issue seems to be addressed (or at least, mitigated) in Lucene 8.10, I think we could benchmark against 2.0.0-SNAPSHOT (which is using Lucene 8.10.1), wdyt?

andrross · 2021-12-06T21:35:29Z

@reta I've been benchmarking against OpenSearch 1.2, which is using Lucene 8.10.1 and it shows comparable performance to all the ES versions greater than 7.9.

Here's the quick test I'm running against the 4 versions on my machine (all populated with the nyc_taxi dataset but otherwise idle). It consistently produces results like the following:

es7.9      "took" : 2396,
es7.10.2   "took" : 3013,
es7.15.2   "took" : 3068,
os1.2      "took" : 3007,
------
es7.9      "took" : 2393,
es7.10.2   "took" : 3012,
es7.15.2   "took" : 3049,
os1.2      "took" : 2999,
------
es7.9      "took" : 2386,
es7.10.2   "took" : 3009,
es7.15.2   "took" : 3055,
os1.2      "took" : 3009,
------

reta · 2021-12-07T17:50:35Z

@andrross so the change came with 31c026f where the upgrade to Lucene 8.7.0 happened and basically switched to new compression.

I did experiment and rebuilt OpenSearch 2.0 SNASPHOT with Lucene86Codec, the latencies went back to Elasticsearch 7.9.x levels:

Elasticsearch 7.9.x

Metric	Task	Run #1	Run #2	Unit
Min Throughput	top_hits_agg	0.36	0.36	ops/s
Mean Throughput	top_hits_agg	0.38	0.38	ops/s
Median Throughput	top_hits_agg	0.38	0.39	ops/s
Max Throughput	top_hits_agg	0.38	0.39	ops/s
50th percentile latency	top_hits_agg	69333.1	67921.7	ms
90th percentile latency	top_hits_agg	107202	105145	ms
100th percentile latency	top_hits_agg	116503	114215	ms
50th percentile service time	top_hits_agg	2507.79	2469.64	ms
90th percentile service time	top_hits_agg	2541.37	2520.94	ms
100th percentile service time	top_hits_agg	2661.41	2541.38	ms
error rate	top_hits_agg	0	0	%

Opensearch 2.0 (SNAPSHOT)

Metric	Task	Run #1	Run #2	Unit
Min Throughput	top_hits_agg	0.27	0.28	ops/s
Mean Throughput	top_hits_agg	0.29	0.29	ops/s
Median Throughput	top_hits_agg	0.29	0.29	ops/s
Max Throughput	top_hits_agg	0.29	0.3	ops/s
50th percentile latency	top_hits_agg	97998.2	96045.2	ms
90th percentile latency	top_hits_agg	150750	151238	ms
100th percentile latency	top_hits_agg	164118	164771	ms
50th percentile service time	top_hits_agg	3238.32	3251.69	ms
90th percentile service time	top_hits_agg	3284.75	3449.24	ms
100th percentile service time	top_hits_agg	3309.06	3542.48	ms
error rate	top_hits_agg	0	0	%

Opensearch 2.0 (SNAPSHOT) with Lucene86Codec

Metric	Task	Run #1	Run #2	Unit
Min Throughput	top_hits_agg	0.33	0.29	ops/s
Mean Throughput	top_hits_agg	0.35	0.33	ops/s
Median Throughput	top_hits_agg	0.35	0.34	ops/s
Max Throughput	top_hits_agg	0.35	0.34	ops/s
50th percentile latency	top_hits_agg	77534.8	79022.5	ms
90th percentile latency	top_hits_agg	118762	120599	ms
100th percentile latency	top_hits_agg	129576	131161	ms
50th percentile service time	top_hits_agg	2679.91	2672.98	ms
90th percentile service time	top_hits_agg	2759	2714.66	ms
100th percentile service time	top_hits_agg	2869.71	2828.59	ms
error rate	top_hits_agg	0	0	%

I actually don't know what we should be doing in this case, the change seems to be related to the Lucene87Codec implementation.

andrross · 2021-12-08T01:40:56Z

Thanks @reta ! I was just working on setting up a similar experiment this morning :)

I'm also not sure what we should be doing here. It looks like there was an intentional decision to increase compression performance that the cost of slower reads in some cases, and for whatever reason this case seems to be particularly bad with the new codec.

reta · 2021-12-08T15:56:59Z

I think the one thing we could certainly do is to document our findings in the official documentation. The other thing we could do it to benchmark / profile Lucene87Codec to understand where this overhead is coming from and why. Also, another option to consider could be to allow tuning of the codec on Opensearch side, not sure how useful it would be in practice. @nknize @dblock do you guys have some thoughts on the matter? Thank you!

dblock · 2021-12-09T16:27:31Z

Is there an option to make compression level configurable at node/cluster level?

Generally, documenting is always a good idea. Other ideas include a blog post that posts the above benchmarks, explains the trade-off and why it was a good one.

reta · 2021-12-09T18:33:29Z

thanks @dblock

Is there an option to make compression level configurable at node/cluster level?

Not at the moment, OpenSearch uses the latest Lucene Codec: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/codec/CodecService.java#L63

Generally, documenting is always a good idea. Other ideas include a blog post that posts the above benchmarks, explains the trade-off and why it was a good one.

👍 I think we need a bit of work here on OpenSearch side to match the Apache Lucene findings

andrross · 2021-12-16T02:47:14Z

I believe this Lucene issue actually does fix the regression. I didn't realize that the testing tool was setting BEST_COMPRESSION by default for the given workload, but when I run the test against in index that has been configured with index.codec=default then the regression nearly completely disappears.

Test script:

$ ./top-hits-test.sh
es7.9      "took" : 737,
os1.0      "took" : 2707,
os1.2      "took" : 961,
------
es7.9      "took" : 744,
os1.0      "took" : 2700,
os1.2      "took" : 933,
------
es7.9      "took" : 742,
os1.0      "took" : 2703,
os1.2      "took" : 915,
------

Note that BEST_COMPRESSION is in fact quite a bit slower on new versions as compared to ES 7.9, but that's because the new Lucene codec has been changed to significantly improve compression at the cost of slower performance. The performance change is minimal when comparing the versions using the default codec (BEST_SPEED).

andrross · 2021-12-16T19:15:43Z

To be clear, the fact that the Lucene change mitigates the regression means that OpenSearch 1.2 has the fix because it uses Lucene 8.10.1 and no action is needed to address this within OpenSearch. I think we can close this issue. However, there are a few things we should consider:

Document the change in performance of BEST_COMPRESSION (better compression ratio, slower fetch) so anyone coming from ES 7.9 or older to OpenSearch can be aware of the tradeoffs.
Ensure that we have performance tests that stress the performance of fetching documents from source so that we can detect changes in performance in this area in future releases and Lucene upgrades.
Add profiling to the fetch phase of searches. The flame graph of the JVM profile taken using Java Flight Recorder pointed me to the fetch phase and decompression for this issue, but if the OpenSearch profile data itself had detailed stats on the fetch phase, then it would have been obvious much more quickly without needing to profile the JVM itself.

I'll close this issue and create separate issues for the above items unless anyone has objections or additional comments.

reta · 2021-12-16T21:22:11Z

Document the change in performance of BEST_COMPRESSION (better compression ratio, slower fetch) so anyone coming from ES 7.9 or older to OpenSearch can be aware of the tradeoffs.

👍

Ensure that we have performance tests that stress the performance of fetching documents from source so that we can detect changes in performance in this area in future releases and Lucene upgrades.

Should be part of #1276, I really like the way Apache Lucene and Elasticsearch guys publish the benchmarks every day.

Add profiling to the fetch phase of searches. The flame graph of the JVM profile taken using Java Flight Recorder pointed me to the fetch phase and decompression for this issue, but if the OpenSearch profile data itself had detailed stats on the fetch phase, then it would have been obvious much more quickly without needing to profile the JVM itself.

👍 this is interesting but would certainly help to detect issues like that

andrross added bug Something isn't working untriaged labels Dec 2, 2021

anasalkouz added Search:Aggregations and removed untriaged labels Dec 3, 2021

anasalkouz assigned andrross Dec 3, 2021

dblock changed the title ~~[BUG] Preformance regression in top_hits aggregation~~ [BUG] Performance regression in top_hits aggregation Dec 6, 2021

This was referenced Dec 17, 2021

Add fetch phase to search profile #1764

Open

Benchmark JVMs for OpenSearch #1276

Closed

andrross closed this as completed Dec 18, 2021

andrross mentioned this issue Dec 28, 2021

Document performance changes for BEST_COMPRESSION since ES 7.9 opensearch-project/documentation-website#348

Closed

andrross mentioned this issue Apr 12, 2022

Performance degradation in OpenSearch 1.1.0 due to Lucene 8.9.0 #2820

Closed

nknize mentioned this issue Aug 18, 2023

[Demote] ZStd compression from GA / LTS to experimental and release a 2.9.1 patch #9422

Closed

sarthakaggarwal97 mentioned this issue Feb 12, 2024

[BUG] Index size on opensearch is bigger then elasticsearch #3769

Open

sarthakaggarwal97 mentioned this issue Apr 4, 2024

Add hardware-accelerated codecs for DEFLATE and LZ4 opensearch-project/custom-codecs#122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Performance regression in `top_hits` aggregation #1647

[BUG] Performance regression in `top_hits` aggregation #1647

andrross commented Dec 2, 2021 •

edited

Loading

reta commented Dec 2, 2021 •

edited

Loading

andrross commented Dec 3, 2021

nknize commented Dec 3, 2021 •

edited

Loading

reta commented Dec 3, 2021 •

edited

Loading

andrross commented Dec 3, 2021

reta commented Dec 3, 2021

andrross commented Dec 6, 2021

andrross commented Dec 6, 2021

reta commented Dec 6, 2021 •

edited

Loading

andrross commented Dec 6, 2021

reta commented Dec 6, 2021 •

edited

Loading

reta commented Dec 6, 2021 •

edited

Loading

andrross commented Dec 6, 2021

reta commented Dec 7, 2021 •

edited

Loading

andrross commented Dec 8, 2021

reta commented Dec 8, 2021

dblock commented Dec 9, 2021 •

edited

Loading

reta commented Dec 9, 2021

andrross commented Dec 16, 2021

andrross commented Dec 16, 2021

reta commented Dec 16, 2021

[BUG] Performance regression in top_hits aggregation #1647

[BUG] Performance regression in top_hits aggregation #1647

Comments

andrross commented Dec 2, 2021 • edited Loading

reta commented Dec 2, 2021 • edited Loading

andrross commented Dec 3, 2021

nknize commented Dec 3, 2021 • edited Loading

reta commented Dec 3, 2021 • edited Loading

andrross commented Dec 3, 2021

reta commented Dec 3, 2021

andrross commented Dec 6, 2021

andrross commented Dec 6, 2021

reta commented Dec 6, 2021 • edited Loading

andrross commented Dec 6, 2021

reta commented Dec 6, 2021 • edited Loading

reta commented Dec 6, 2021 • edited Loading

andrross commented Dec 6, 2021

reta commented Dec 7, 2021 • edited Loading

andrross commented Dec 8, 2021

reta commented Dec 8, 2021

dblock commented Dec 9, 2021 • edited Loading

reta commented Dec 9, 2021

andrross commented Dec 16, 2021

andrross commented Dec 16, 2021

reta commented Dec 16, 2021

[BUG] Performance regression in `top_hits` aggregation #1647

[BUG] Performance regression in `top_hits` aggregation #1647

andrross commented Dec 2, 2021 •

edited

Loading

reta commented Dec 2, 2021 •

edited

Loading

nknize commented Dec 3, 2021 •

edited

Loading

reta commented Dec 3, 2021 •

edited

Loading

reta commented Dec 6, 2021 •

edited

Loading

reta commented Dec 6, 2021 •

edited

Loading

reta commented Dec 6, 2021 •

edited

Loading

reta commented Dec 7, 2021 •

edited

Loading

dblock commented Dec 9, 2021 •

edited

Loading