Use `Collector.setWeight` to improve aggregation performance (for special cases) #10954

msfroh · 2023-10-27T01:15:52Z

Lucene added a new setWeight method to the Collector interface a while back (see https://issues.apache.org/jira/browse/LUCENE-10620), specifically to give collectors access to the Weight.count() method.

Weight.count() only has a few cases where it returns things other than -1 (the value meaning "I can't give you a cheap count"), but the cases where it does return are pretty useful -- mostly "match all" or "match none", but for a single term query will return "I match exactly this many", if there are no deletions in the current segment (since it just reads the term's doc freq).

I believe this can be useful to short-circuit some aggregation logic, since aggregations all extend Collector.

These are the special cases that I've been able to think of where the weight.count(leafReaderContext) could hint at smarter computation of aggregations:

If the top-level query matches nothing int the current segment (i.e. weight.count(leafReaderContext) == 0), then count for every bucket is 0. (If the min count is greater than 0, then you don't need to compute any buckets for this segment.)
If the top-level query matches everything in the current segment (i.e. weight.count(leafReaderContext) == leafReaderContext.reader().maxDoc()), then the count of hits in a bucket (from the current segment) is determined entirely by the count of the bucket, which may be cheap to compute (e.g. doc freq for a terms aggregation, maybe read count from the BKD tree for a range aggregation).
If the top-level query has some other positive count, but a bucket matches everything in the current segment (e.g. the documents in the current segment are all from the same day and we're computing a daily date histogram), then the bucket count is weight.count(leafReaderContext).

I didn't give it a lot of thought, so there might be some more that I'm missing.

The text was updated successfully, but these errors were encountered:

msfroh · 2023-10-27T18:51:38Z

I think I would probably start with TermsAggregator (and its subclasses) to try to leverage this for cases where the top-level query matches all documents.

getsaurabh02 · 2023-10-30T22:02:14Z

I like the idea of TermsAggregator given it can benefit for String Values (Global Ordinals), Map String and Numeric term aggregation cases. In scenarios, where the top-level query matches everything in the current segment, I think the constraints holds that are no deletions in the current segment, to be able to short circuit (such as bucket count) ?

Should we try to instrument for the flow and run it for a time series data set (and benchmark)?
cc: @sandeshkr419

msfroh added enhancement Enhancement or improvement to existing feature or request untriaged Search:Performance and removed untriaged labels Oct 27, 2023

msfroh mentioned this issue Nov 15, 2023

Improving the performance of date histogram aggregation (without any sub-aggregation) #11083

Merged

7 tasks

sandeshkr419 mentioned this issue Dec 19, 2023

Improve string terms aggregation performance using Collector#setWeight #11643

Merged

8 tasks

msfroh closed this as completed in #11643 Mar 12, 2024

reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.13.0 Issues and PRs related to version 2.13.0 labels Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `Collector.setWeight` to improve aggregation performance (for special cases) #10954

Use `Collector.setWeight` to improve aggregation performance (for special cases) #10954

msfroh commented Oct 27, 2023

msfroh commented Oct 27, 2023

getsaurabh02 commented Oct 30, 2023

Use Collector.setWeight to improve aggregation performance (for special cases) #10954

Use Collector.setWeight to improve aggregation performance (for special cases) #10954

Comments

msfroh commented Oct 27, 2023

msfroh commented Oct 27, 2023

getsaurabh02 commented Oct 30, 2023

Use `Collector.setWeight` to improve aggregation performance (for special cases) #10954

Use `Collector.setWeight` to improve aggregation performance (for special cases) #10954