[Feature Request] Make use of dynamic pruning for faster cardinality aggregations #11959

rishabhmaurya · 2024-01-21T21:24:19Z

Is your feature request related to a problem? Please describe

Dynamic pruning algorithms work by dynamically adding negating filters into the query as disjunctions which are non-competitive( or found to be no more competitive while query execution) to prune the search space.
One of the utility is cardinality aggregations. Instead of using negative filter in lucene, if the field is a low cardinality field (to avoid explosion of disjunctions), then the query can be rewritten as a disjunction query of all the unique terms of the field. As the matching documents are evaluated, if the field value is a unique encountered so far, then it can be safely removed from the disjunction query. This ensures that the documents aren't processed twice for a unique value of a field on which cardinality aggregation is run. Also, the query will be early terminated if all disjunctive filters are exhausted.

Describe the solution you'd like

This logic can easily be embedded into Query.rewrite() method when the cardinality aggregation is the only aggregation and field is low cardinality field. Field cardinality upper bound for a query can be estimated either from FieldReader size() or SortedSetDocValues getValueCount().

We need run some benchmarks in order to know when to start enabling this optimization as it may not be very helpful in smaller corpus. I propose running it against noaa workload - https://github.com/opensearch-project/opensearch-benchmark-workloads/blob/bdbd4bbd74fbf319398de1ca169f16744821bcde/noaa/operations/default.json#L765

Related component

Search:Aggregations

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

rishabhmaurya · 2024-01-21T21:26:27Z

cc @getsaurabh02 @msfroh let me know your thoughts

rishabhmaurya · 2024-01-23T20:28:51Z

this is the idea from the blogpost - https://www.elastic.co/blog/faster-cardinality-aggregations-dynamic-pruning

rishabhmaurya · 2024-02-14T20:04:26Z

Here is my attempt at it rishabhmaurya#74 . Thanks @msfroh for suggesting work around to maintain the invariant of Conjunction DISI by lazily propagating the docID on next()/advance() and also validating this approach.

@kkmr and @msfroh Let me know your thoughts, if below algorithm is reasonable enough to proceed here?
Also, if we can optimize upon any of these steps?

Here is the breakdown of algorithm -

Check for all preconditions on when this optimization can be enabled -

1. Only enabled when Cardinality Aggregation is the only aggregation.
1. The field is a low cardinality field.
1. Field type is one of Keyword, Numeric?
1. Other?

Once preconditions are met, while collectors are created and picked for a given segment, create a DynamicPruningCollectorWrapper to wrap the collector with optimization.
DynamicPruningCollectorWrapper will enumerate all the terms for the given field and creates a DisjunctionWithDynamicPruningScorer similar to DisjunctionScorer in lucene in conjunction with the parent query. DisjunctionWithDynamicPruningScorer scorer should have following capabilities in addition to what DisjunctionScorer have -

1. #removeAllDISIsOnCurrentDoc() - it removes all the DISIs for subscorer pointing to current doc. This is helpful in dynamic pruning for Cardinality aggregation, where once a term is found, it becomes irrelevant for rest of the search space, so this term's subscorer DISI can be safely removed from list of subscorer to process.
1. #removeAllDISIsOnCurrentDoc() breaks the invariant of Conjuction DISI i.e. the docIDs of all sub-scorers should be less than or equal to current docID iterator is pointing to. When we remove elements from priority queue, it results in heapify action, which modifies the top of the priority queue, which represents the current docID for subscorers here. To address this, we are wrapping the iterator with SlowDocIdPropagatorDISI which keeps the iterator pointing to last docID before #removeAllDISIsOnCurrentDoc() is called and updates this docID only when next() or advance() is called.

When collection of document will start and DynamicPruningCollectorWrapper is used, it will collect all the documents at once by iterating over all the document from the query created in step 3.
Dynamic pruning step when collecting a document - when a match is found, all the terms for a given document will be enumerated and collected for cardinality computation. Once done, the subscorer DISI corresponding to each of these terms collected, can be safely removed from the DisjunctionWithDynamicPruningScorer by calling removeAllDISIsOnCurrentDoc(). Once all docs are collected, we can straightaway throw CollectionTerminatedException for early termination of query.

The code change contains a test which covers a happy case - https://github.com/rishabhmaurya/OpenSearch/pull/74/files#diff-8c88c6062265deccbf9f504a86750ae8f6e1ae53350f91f8a226e7886d6c3e7cR101

kkmr · 2024-02-19T19:53:49Z

I will take this forward

bowenlan-amzn · 2024-05-03T15:56:07Z

Will work on this.

bowenlan-amzn · 2024-05-28T18:29:37Z

TODO

Find the suitable benchmark for the cardinality aggregation

rishabhmaurya added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 21, 2024

github-actions bot added the Search:Aggregations label Jan 21, 2024

rishabhmaurya removed the untriaged label Jan 21, 2024

getsaurabh02 added the v2.13.0 Issues and PRs related to version 2.13.0 label Feb 5, 2024

rishabhmaurya mentioned this issue Feb 14, 2024

[DRAFT] Cardinality aggregation dynamic pruning changes (to be used only for prototype and reference purpose, not intended to merge to main) #12323

Closed

8 tasks

rishabhmaurya mentioned this issue Feb 15, 2024

Cardinality aggregation dynamic pruning changes rishabhmaurya/OpenSearch#74

Draft

8 tasks

jainankitk mentioned this issue Feb 20, 2024

[RFC] Query Planning and Rewriting #12390

Open

rishabhmaurya assigned kkmr Feb 21, 2024

rishabhmaurya mentioned this issue Mar 4, 2024

[Workload Improvement] Adding single term aggregation task in workloads opensearch-project/opensearch-benchmark-workloads#165

Open

getsaurabh02 added v2.14.0 and removed v2.13.0 Issues and PRs related to version 2.13.0 labels Mar 13, 2024

harshavamsi added v2.15.0 Issues and PRs related to version 2.15.0 and removed v2.14.0 labels May 2, 2024

bowenlan-amzn assigned bowenlan-amzn and unassigned kkmr May 3, 2024

bowenlan-amzn mentioned this issue May 24, 2024

Support Dynamic Pruning in Cardinality Aggregation #13821

Merged

9 tasks

mch2 closed this as completed in #13821 Jun 11, 2024

bowenlan-amzn mentioned this issue Jun 11, 2024

[Backport 13821] Support Dynamic Pruning in Cardinality Aggregation #14203

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Make use of dynamic pruning for faster cardinality aggregations #11959

[Feature Request] Make use of dynamic pruning for faster cardinality aggregations #11959

rishabhmaurya commented Jan 21, 2024 •

edited

Loading

rishabhmaurya commented Jan 21, 2024 •

edited

Loading

rishabhmaurya commented Jan 23, 2024

rishabhmaurya commented Feb 14, 2024 •

edited

Loading

kkmr commented Feb 19, 2024

bowenlan-amzn commented May 3, 2024

bowenlan-amzn commented May 28, 2024 •

edited

Loading

[Feature Request] Make use of dynamic pruning for faster cardinality aggregations #11959

[Feature Request] Make use of dynamic pruning for faster cardinality aggregations #11959

Comments

rishabhmaurya commented Jan 21, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

rishabhmaurya commented Jan 21, 2024 • edited Loading

rishabhmaurya commented Jan 23, 2024

rishabhmaurya commented Feb 14, 2024 • edited Loading

kkmr commented Feb 19, 2024

bowenlan-amzn commented May 3, 2024

bowenlan-amzn commented May 28, 2024 • edited Loading

rishabhmaurya commented Jan 21, 2024 •

edited

Loading

rishabhmaurya commented Jan 21, 2024 •

edited

Loading

rishabhmaurya commented Feb 14, 2024 •

edited

Loading

bowenlan-amzn commented May 28, 2024 •

edited

Loading