-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] OpenSearch sort query performance regression #5534
Comments
It should be reiterated that the tests above were conducted on Amazon's managed service. Carrying out tests on standalone clusters will provide more definitive information. |
@gkamat Are you planning to run the benchmarks on open-source OpenSearch to confirm that this is in OpenSearch OSS? |
The benchmarks doesn't say anything about the latency percentiles, do you mind adding p50/p90 latencies and output from the Also the below numbers assuming avg latencies seems too off.
Can you also report what led to this observation? |
Will add percentile numbers once I have them. The comment regarding memory footprint is a surmise based on the fact that memory usage has typically increased with newer releases, that the t3 instance type is memory constrained and that there was a substantial latency improvement moving to the r5 instance type. Evidently, more testing is required to confirm this. |
Ruling out search back-pressure (#5039) and resource tracking framework changes (#3982) from causing this regression. I benchmarked a standalone OpenSearch cluster with the following configuration:
Ran opensearch-benchmark using the
|
Lucene gives us in-built ability to optimise sorting on certain sort field types where its point type is matching. i.e for fields with data type Date or Long, we will able to use this optimisation. I tried adding back the code in https://github.com/opensearch-project/OpenSearch/pull/1974/files with sortField.setOptimizeSortWithPoints(true); for Date and Long data types and with that we were able to achieve similar performance (before the removal) in Open Search 2.3. Below are the results from OS_2.3. (ran on managed AOS)
|
@nknize @Rishikesh1159 shold we enable this optimisation back for below 4 types as of now ?
For above 4 types, we have same sort type and point type. So should not be any haarm doing like that. |
Nice work @gashutos. |
Updating here: #6321 was merged to main and the backport to 2.x has also been merged. This solves issues with the four common numeric data types ( |
Closing this since this #6321 is merged and backported to 2.3. As mentioned in PR description, sort latencies are back to normal level what we had earlier. |
@gkamat can we close this issue ? |
Issue has been resolved. |
Summary
OpenSearch performance for sort queries appears to have degraded since version 7.10. A simple timestamp sort query was executed on a 100 GB data set in Amazon's managed service on a few different versions: AES 7.10, AOS 1.3 and AOS 2.3. The clusters used were of two types: the first configured with
t3.medium
data nodes, and the other withr5.large
data nodes.AES 7.10 had the best sort performance with regard to latency, while AOS 2.3 was the worst of the three.
It was presumed that the memory footprint might have increased as newer OpenSearch versions were released and that this might have had an impact on performance, especially since
t3.medium
has only 4 GB RAM (and 2 vCPUs).Therefore, the clusters were upgraded to
r5.large
, which has 2 vCPUs and 16 GB RAM. Latencies did not change substantially for AES 7.10, degraded somewhat for AOS 1.3 and improved substantially for AOS 2.3 (compared to the smaller instance type.) It seems likely that AOS 2.3 has a larger memory footprint, and would benefit from memory-optimized instance types.It is true that the sort query used in the tests is rather simple-minded and that sophisticated users will use a more optimized query; however, not all users have that level of sophistication. The reference query used here was:
The other point to note is the difference in performance with regard to sort ordering (ascending vs. descending):
r5.large
instance type, ascending sort performance is worse than descending in memory-constrained mode with t3’s. Notably, the performance of both is similar with the r5 instances, but still worse than in AOS 1.3.Methodology
Performance was measured using OpenSearch Benchmark (OSB). The data set was a synthetically generated one based on the
http_logs
workload packaged with OSB. Each document was relatively small, ~128 bytes.After the data was ingested and a force-merge carried out, the query above was run for 200 iterations after some warmup iterations.
The clusters each had 3 master nodes (r5.large) and 2 data nodes.
Results
All numbers are latencies in ms. There were two runs in each mode to ensure repeatability.
The text was updated successfully, but these errors were encountered: