Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Test - Evaluate OpenSearch read performance #403

Closed
Tracked by #185
penghuo opened this issue Jun 28, 2024 · 0 comments
Closed
Tracked by #185

Performance Test - Evaluate OpenSearch read performance #403

penghuo opened this issue Jun 28, 2024 · 0 comments
Assignees

Comments

@penghuo
Copy link
Collaborator

penghuo commented Jun 28, 2024

OpenSearch Reader Performance evaluation with PIT

Summary

  • PIT on shard with search_after and PIT on slice with search_after achieve similar results. However, increasing the slice count with PIT on slice with search_after leads to higher query latency.

Test

  • PIT with search_after, In the read path, create a PIT of index and assigned to a Spark task. there is only 1 Spark tasks running to read data from OpenSearch.
  • PIT on shard with search_after: In the read path, each OpenSearch index shard is assigned to a Spark task. For example, if the OpenSearch index has 5 shards, then 5 Spark tasks will run in parallel.
  • PIT on slice with search_after. In the read path, OpenSearch index is split into slice, each slice is assigned to a Spark task. For instance, if the OpenSearch index has 10 slices, then 10 Spark tasks will run in paralle.
    Note: if slice count equal to shard count, slice query is rewritten as shard query (MatchAllDocsQuery on shards). the query performance is as same as PIT on shard with search_after.
Query PIT with search_after PIT on shard with search_after PIT on slice with search_after
SELECT COUNT(*) FROM dev.default.logs-181998 49745.0 18256.0 18667.0
SELECT COUNT(*) FROM dev.default.logs-181998 WHERE status <> 0; 50981.0 18247.0 19006.0
SELECT COUNT(*), AVG(size) FROM dev.default.logs-181998; 49494.0 19951.0 20171.0
SELECT AVG(CAST(size AS BIGINT)) FROM dev.default.logs-181998; 50906.0 19719.0 20326.0
SELECT MIN(@timestamp), MAX(@timestamp) FROM dev.default.logs-181998; 44016.0 20459.0 19915.0
SELECT status, COUNT() FROM dev.default.logs-181998 WHERE status <> 0 GROUP BY status ORDER BY COUNT() DESC; 50560.0 20686.0 20550.0

output

  • increase slice count
    Ideally, increasing the slice count should enhance parallelization. However, test results indicate that query latency increases when the slice count is raised.
    The root cause is slice query is rewrite as TermsSliceQuery which use _id as field. For instance, the following query. is rewrite as TermsSliceQuery on shard 0 (MatchNoDocsQuery on other shards). The cost of this filter is O(N*M) where N is the number of unique terms in the dictionary and M is the average number of documents per term. In our case, the cost of this fitler is O(N) (_id is unique, and M=1).
Query Slice Counts == Shard Counts Slice Counts = 2 * Shard Counts
SELECT COUNT(*) FROM dev.default.logs-181998 18667 35511
SELECT COUNT(*) FROM dev.default.logs-181998 WHERE status <> 0; 19006 34497
SELECT COUNT(*), AVG(size) FROM dev.default.logs-181998; 20171 32213
SELECT AVG(CAST(size AS BIGINT)) FROM dev.default.logs-181998; 20326 35226
SELECT MIN(@timestamp), MAX(@timestamp) FROM dev.default.logs-181998; 19915 32996
SELECT status, COUNT() FROM dev.default.logs-181998 WHERE status <> 0 GROUP BY status ORDER BY COUNT() DESC; 20550 37419

output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant