-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Performance regression in top_hits
aggregation
#1647
Comments
@andrross do you mind to check the return of the |
I do indeed get different results for the query in question here, so it looks like the referenced bug may be a factor. |
I'm curious about the referenced bug fixes. While they are something we should consider looking at, those fixes are related to |
@nknize I was just checking the recent changes (and bugfixes) related to aggregation, you may be right, but the difference in aggregation results returned between these two runs (as confirmed by @andrross) is exactly what was observed in elastic/elasticsearch#65624 |
I think I spoke too soon about the results being wrong. The results are indeed different, but in this case the query is looking for the "top hit" on the |
Ah ... thanks @andrross , we probably could just profile this particular aggregation using 7.9 and 7.10 via |
Also, I've tested this "top hits" query against ES7.11 and ES7.15 and both appear to exhibit the regressions relative to ES7.9. |
Interesting, seems to be https://issues.apache.org/jira/browse/LUCENE-9486, ES 7.9 is using Lucene 8.6 whereas 7.10 is on 8.7 where this feature has been delivered. Update: Basically, runs down to: LZ4(16kB) (current BEST_SPEED) vs Deflate with preset dict (new BEST_COMPRESSION) and the benchmarks on the ticket do show significant difference impact on fetch time. |
top_hits
aggregationtop_hits
aggregation
@reta Is there something peculiar about the top hits query that causes the higher fetch times to manifest as significantly slower queries? Why are the other query-focused performance tests not also impacted? |
@andrross very good question but I don't know to be honest, what is interesting is that the aggregation is using global ordinals which kick in particular for |
Relevant discussions on mailing lists and tickets:
@andrross ES 7.15.x still uses Lucene 8.9, the issue seems to be addressed (or at least, mitigated) in Lucene 8.10, I think we could benchmark against 2.0.0-SNAPSHOT (which is using Lucene 8.10.1), wdyt? |
@reta I've been benchmarking against OpenSearch 1.2, which is using Lucene 8.10.1 and it shows comparable performance to all the ES versions greater than 7.9. Here's the quick test I'm running against the 4 versions on my machine (all populated with the nyc_taxi dataset but otherwise idle). It consistently produces results like the following:
|
@andrross so the change came with 31c026f where the upgrade to Lucene 8.7.0 happened and basically switched to new compression. I did experiment and rebuilt OpenSearch 2.0 SNASPHOT with Lucene86Codec, the latencies went back to Elasticsearch 7.9.x levels: Elasticsearch 7.9.x
Opensearch 2.0 (SNAPSHOT)
Opensearch 2.0 (SNAPSHOT) with Lucene86Codec
I actually don't know what we should be doing in this case, the change seems to be related to the |
Thanks @reta ! I was just working on setting up a similar experiment this morning :) I'm also not sure what we should be doing here. It looks like there was an intentional decision to increase compression performance that the cost of slower reads in some cases, and for whatever reason this case seems to be particularly bad with the new codec. |
I think the one thing we could certainly do is to document our findings in the official documentation. The other thing we could do it to benchmark / profile |
Is there an option to make compression level configurable at node/cluster level? Generally, documenting is always a good idea. Other ideas include a blog post that posts the above benchmarks, explains the trade-off and why it was a good one. |
thanks @dblock
Not at the moment, OpenSearch uses the latest Lucene Codec: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/codec/CodecService.java#L63
👍 I think we need a bit of work here on OpenSearch side to match the Apache Lucene findings |
I believe this Lucene issue actually does fix the regression. I didn't realize that the testing tool was setting
Note that |
To be clear, the fact that the Lucene change mitigates the regression means that OpenSearch 1.2 has the fix because it uses Lucene 8.10.1 and no action is needed to address this within OpenSearch. I think we can close this issue. However, there are a few things we should consider:
I'll close this issue and create separate issues for the above items unless anyone has objections or additional comments. |
👍
Should be part of #1276, I really like the way Apache Lucene and Elasticsearch guys publish the benchmarks every day.
👍 this is interesting but would certainly help to detect issues like that |
I believe I have isolated a regression in performance of the
top_hits
aggregation between ES 7.9 and ES 7.10.2 (which means the regression is present in OpenSearch 1.0 as well).To Reproduce
I added the following operation to the default
nyc_taxis
track in es rally:After populating both a 7.9 and 7.10.2 cluster with the
nyc_taxis
dataset, I then ran a race with the following commands:Expected behavior
The expected behavior is that there will be very similar performance for the
top_hits
aggregation between the two versions of the software. The following data compares ES7.9 as baseline and ES7.10.2 as the contender. As expected, the performance of thedate_histogram_agg
operation is more-or-less at parity between the two. However, thetop_hits
aggregation performance is consistently 20-30% worse in 7.10.2 than 7.9:Plugins
None
Host/Environment (please complete the following information):
Additional context
The elasticsearch distributions are just the stock versions downloaded as tar files, extracted and started with the following command:
The only non-default setting in the yml configuration file is the port number. Both instances are running on the same machine, though there load is only driven one at a time (not concurrently). The service is bound to localhost and es rally is running on the same machine.
The text was updated successfully, but these errors were encountered: