Added aggregation precomputation for rare terms #18106

ajleong623 · 2025-04-28T16:32:39Z

Description

This change expands on using the techniques from @sandeshkr419 pull request #11643 to precompute aggregations for match all or match none queries. We can leverage reading from termsEnum to precompute the aggregation when the field is indexed and when there are no deletions. We can check that no terms are deleted by using the weight and checking if it matches maxDocs of the reader.

Unfortunately, I was not able to use the same technique for numeric aggregators like LongRareTermsAggregator. This is because the numeric points are not indexed by frequency of terms but instead through KD-trees to optimize for different types of operations https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/PointValues.java.

Please let me know if there are any comments, concerns or suggestions.

Related Issues

Resolves #13123
#13122
#10954

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2025-04-28T16:55:08Z

❌ Gradle check result for f6371a2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-28T18:31:03Z

❌ Gradle check result for 844164e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-04-28T19:16:41Z

❌ Gradle check result for 0f3bd75: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

ajleong623 · 2025-04-28T23:25:52Z

It looks like I am failing the test for org.opensearch.cache.common.tier.TieredSpilloverCacheStatsIT.testClosingShard, however when I tried running this test on my local computer, it passes. What could be happening?

Edit: Sorry, but it actually looks like the test did not pass on my system. I also tested it on the current codebase without any changes that I made, and it did not pass. Therefore, I do not think that my code affects the test.

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

github-actions · 2025-06-30T02:03:55Z

❌ Gradle check result for b5e08d8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

github-actions · 2025-06-30T02:25:30Z

❌ Gradle check result for ebca7e1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

github-actions · 2025-06-30T03:11:33Z

❌ Gradle check result for 9d73b57: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

… completed action items Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

github-actions · 2025-06-30T07:14:32Z

✅ Gradle check result for b4a4128: SUCCESS

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

ajleong623 · 2025-06-30T19:51:07Z

@sandeshkr419 I believe all the comments were addressed. Rather than making a new class to return the expected count of the missing aggregation too, I simply put a check in the searchAndReduceCounting function. I also remove a lot of the non deterministic tests and made them deterministic. I added extra tests too for better coverage.

The other action item is adding the workloads to the opensearch-benchmark-workloads repository. Do I just add those query bodies in the big5/queries folder?

github-actions · 2025-06-30T21:51:07Z

❌ Gradle check result for b60c221: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

sandeshkr419 · 2025-06-30T21:44:04Z

.../src/main/java/org/opensearch/search/aggregations/bucket/terms/MapStringTermsAggregator.java

+        // TODO: A note is that in scripted aggregations, the way of collecting from buckets is determined from
+        // the script aggregator. For now, we will not be able to support the script aggregation.
+
+        if (subAggregators.length > 0 || includeExclude != null || fieldName == null) {


You can pull up null checks for weight and config here so that you don't have to assert it again.

Right now you are checking for config != null twice, and checking up (weight.count(ctx) == ctx.reader().getDocCount(fieldName) before checking for weight == null.

We might be able to proceed if config == null, but if there is a script or there is both a missing parameter and there are actual missing values, we will not be able to use the precomputation optimization. But I can move up the weight check.

sandeshkr419 · 2025-06-30T21:44:44Z

.../src/main/java/org/opensearch/search/aggregations/bucket/terms/MapStringTermsAggregator.java

+        // field missing, we might not be able to use the index unless there is some way we can
+        // calculate which ordinal value that missing field is (something I am not sure how to
+        // do yet).
+        if (config != null && config.missing() != null && ((weight.count(ctx) == ctx.reader().getDocCount(fieldName)) == false)) {


nit: weight.count(ctx) != ctx.reader().getDocCount(fieldName) instead of asserting equality as false.

Right. I looked at the formatting guidelines again, and I only have to assert the equality as false for unary negations.

sandeshkr419 · 2025-06-30T21:45:22Z

.../src/main/java/org/opensearch/search/aggregations/bucket/terms/MapStringTermsAggregator.java

+
+        // The optimization could only be used if there are no deleted documents and the top-level
+        // query matches all documents in the segment.
+        if (weight == null) {


nit: Moving this null check towards the start of method can make this more readable.

sandeshkr419 · 2025-06-30T21:46:40Z

.../src/main/java/org/opensearch/search/aggregations/bucket/terms/MapStringTermsAggregator.java

+            if (bucketOrdinal < 0) { // already seen
+                bucketOrdinal = -1 - bucketOrdinal;
+            }
+            int amount = stringTermsEnum.docFreq();


nit: rename amount to docCount or docFreq

sandeshkr419 · 2025-06-30T21:49:02Z

.../src/main/java/org/opensearch/search/aggregations/bucket/terms/MapStringTermsAggregator.java

+                bucketOrdinal = -1 - bucketOrdinal;
+            }
+            int amount = stringTermsEnum.docFreq();
+            if (resultStrategy instanceof SignificantTermsResults) {


nit:

if (resultStrategy instanceof SignificantTermsResults sigTermsResultStrategy) { sigTermsResultStrategy.updateSubsetSizes(0L, docCount); }

sandeshkr419 · 2025-06-30T21:51:09Z

server/src/main/java/org/opensearch/search/aggregations/bucket/missing/MissingAggregator.java

+        if (fieldName == null) {
+            // The optimization does not work when there are subaggregations or if there is a filter.
+            // The query has to be a match all, otherwise
+            //


I think the comment is misplaced here.
Can you please check the comments on the entire PR once. Also, please remove empty comment lines.

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

github-actions · 2025-07-01T02:17:08Z

❌ Gradle check result for 0375104: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

github-actions · 2025-07-02T05:51:48Z

✅ Gradle check result for 86a23cb: SUCCESS

opensearch-trigger-bot · 2025-08-03T15:23:00Z

This PR is stalled because it has been open for 30 days with no activity.

sandeshkr419 · 2025-08-06T21:55:20Z

Since you already have a new rebased PR, I'm closing this one to reduce noise. I'll continue reviewing the new PR.

github-actions bot added Search:Aggregations Search:Performance labels Apr 28, 2025

ajleong623 marked this pull request as ready for review April 28, 2025 16:35

ajleong623 requested review from Bukhtawar, CEHENKLE, Rishikesh1159, VachaShah, anasalkouz, andrross, ashking94, cwperks, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, nknize, owaiskazi19, reta, sachinpkale, saratvemulapalli, shwetathareja and sohami as code owners April 28, 2025 16:35

ajleong623 marked this pull request as draft April 28, 2025 17:10

ajleong623 marked this pull request as ready for review April 29, 2025 00:16

opensearch-infra bot added the lucene label Jun 27, 2025

added edits to missing terms agg tests

b5e08d8

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

spotless

ebca7e1

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

test spotless check

9d73b57

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

ajleong623 added 2 commits June 29, 2025 22:49

added new expected counts tests for string rare aggregation tests and…

66171ca

… completed action items Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

spotless

b4a4128

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

mode tests more deterministic and improved coverage

b60c221

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

sandeshkr419 reviewed Jun 30, 2025

View reviewed changes

checked comments, removed more nondeterminism, and reformatted

0375104

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

This was referenced Jul 1, 2025

[AUTOCUT] Gradle Check Flaky Test Report for IndexActionIT #16576

Open

[AUTOCUT] Gradle Check Flaky Test Report for ClientYamlTestSuiteIT #14319

Open

tests pass, I think

86a23cb

Signed-off-by: Anthony Leong <aj.leong623@gmail.com>

opensearch-ci-bot mentioned this pull request Jun 27, 2025

[AUTOCUT] Gradle Check Flaky Test Report for SmokeTestMultiNodeClientYamlTestSuiteIT #14408

Open

opensearch-trigger-bot bot added the stalled Issues that have stalled label Aug 3, 2025

ajleong623 mentioned this pull request Aug 5, 2025

Aggregation precomputation (rebased) #18927

Draft

3 tasks

sandeshkr419 closed this Aug 6, 2025

github-project-automation bot moved this from In Progress to Done in Performance Roadmap Aug 6, 2025

opensearch-ci-bot mentioned this pull request Aug 23, 2025

[AUTOCUT] Gradle Check Flaky Test Report for IngestFromKafkaIT #17215

Open

bowenlan-amzn removed their assignment Aug 25, 2025

ajleong623 mentioned this pull request Nov 20, 2025

Aggregation precomputation missing terms #19627

Open

3 tasks

Added aggregation precomputation for rare terms #18106

Added aggregation precomputation for rare terms #18106

Uh oh!

Conversation

ajleong623 commented Apr 28, 2025 • edited by peterzhuamazon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

ajleong623 commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

ajleong623 commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 30, 2025

Uh oh!

sandeshkr419 Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

ajleong623 Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sandeshkr419 Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

ajleong623 Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

sandeshkr419 Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

sandeshkr419 Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

sandeshkr419 Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

sandeshkr419 Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 1, 2025

Uh oh!

github-actions bot commented Jul 2, 2025

Uh oh!

opensearch-trigger-bot bot commented Aug 3, 2025

Uh oh!

sandeshkr419 commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ajleong623 commented Apr 28, 2025 •

edited by peterzhuamazon

Loading

ajleong623 commented Apr 28, 2025 •

edited

Loading

ajleong623 commented Jun 30, 2025 •

edited

Loading

ajleong623 Jun 30, 2025 •

edited

Loading