-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add Hybrid Cardinality collector to prioritize Ordinals Collector #19524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Hybrid Cardinality collector to prioritize Ordinals Collector #19524
Conversation
|
❌ Gradle check result for a2f5dd7: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 41a9e69: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for c142ac4: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 88989f3: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
88989f3 to
c142ac4
Compare
|
❌ Gradle check result for c142ac4: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
c142ac4 to
06ce5c3
Compare
|
❌ Gradle check result for 06ce5c3: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
06ce5c3 to
fc328a2
Compare
|
❌ Gradle check result for fc328a2: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/174/
|
|
Benchmarks seem okay, probably most aggregations seem slightly regressed as a variant run. |
|
@sandeshkr419 Thanks for running the benchmark. I am wondering why this change did not help cardinality aggregation queries. Can u help me setting up this workload ? I think we should debug this to understand. |
|
{"run-benchmark-test": "id_3"} |
|
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/4954/ . Final results will be published once the job is completed. |
|
{"run-benchmark-test": "id_3"} |
|
@anandpatel9998 I have re-triggerred the runs a few times to avoid variance which happened in the first run. |
|
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/4955/ . Final results will be published once the job is completed. |
|
{"run-benchmark-test": "id_3"} |
|
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/4956/ . Final results will be published once the job is completed. |
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/4954/
|
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/176/
|
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/4955/
|
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/177/
|
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/4956/
|
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/178/
|
|
@anandpatel9998 I do not think we have any query where the switch between ordinal to direct will happen in big5, so we probably need to work on creating such query. It is good to verify if there aren't any regressions on big5. |
WalkthroughThis PR introduces a hybrid cardinality collector feature that automatically switches from an Ordinals-based collector to a Direct (hash-based) collector when memory usage exceeds a configured threshold. It includes new cluster settings for configuration, integration with search contexts, and test updates validating hybrid collector behavior. Changes
Sequence DiagramsequenceDiagram
participant Client
participant SearchContext
participant CardinalityAggregator
participant Collector
participant CollectorSwitch
Client->>SearchContext: Initialize search
SearchContext->>SearchContext: evaluateCardinalityAggregationContext()
SearchContext->>SearchContext: Read cluster settings (hybrid enabled, memory threshold)
SearchContext->>CardinalityAggregator: Pass CardinalityAggregationContext
Client->>CardinalityAggregator: Execute cardinality aggregation
CardinalityAggregator->>CardinalityAggregator: pickCollector() - check if hybrid enabled
alt Hybrid Enabled
CardinalityAggregator->>Collector: Create HybridCollector(OrdinalsCollector)
Note over Collector: OrdinalsCollector configured with memory threshold & callback
else Hybrid Disabled
CardinalityAggregator->>Collector: Create collector by existing heuristic
end
Client->>Collector: collect() during aggregation
Collector->>Collector: Accumulate values, track memory usage
alt Memory Threshold Exceeded
Collector->>CollectorSwitch: Trigger onMemoryLimitReached callback
CollectorSwitch->>CollectorSwitch: Switch activeCollector to DirectCollector
CollectorSwitch->>Collector: Continue collection with DirectCollector
Note over CollectorSwitch: hybridCollectorsUsed counter incremented
end
Collector->>Collector: postCollect() and close()
CardinalityAggregator->>CardinalityAggregator: collectDebugInfo (report hybrid_collectors_used)
CardinalityAggregator-->>Client: Return aggregation results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/src/main/java/org/opensearch/search/aggregations/metrics/CardinalityAggregator.java (1)
650-666: Critical: Undefined variables and incorrect callback invocation will cause compilation errors.This code has multiple issues:
- Line 655 uses
bucketOrdbut the method parameter isbucket- Line 660 references
docwhich is not available ingetBitArray(it's a parameter incollect, not here)- The callback signature
BiConsumer<Integer, Long>expects(doc, bucketOrd)butgetBitArraydoesn't have access todocThe memory limit check should be performed in
collect()where bothdocandbucketOrdare available, not ingetBitArray().Consider restructuring to check memory in
collect():private BitArray getBitArray(long bucket) { visitedOrds = bigArrays.grow(visitedOrds, bucket + 1); BitArray bits = visitedOrds.get(bucket); if (bits == null) { bits = new BitArray(maxOrd, bigArrays); - visitedOrds.set(bucketOrd, bits); - - if (onMemoryLimitReached.isPresent()) { - currentMemoryUsage += memoryOverhead(maxOrd); - if (currentMemoryUsage > memoryThreshold) { - onMemoryLimitReached.get().accept(doc, bucketOrd); - return; - } - } + visitedOrds.set(bucket, bits); + if (onMemoryLimitReached.isPresent()) { + currentMemoryUsage += memoryOverhead(maxOrd); + } } return bits; } + +public boolean isMemoryLimitExceeded() { + return currentMemoryUsage > memoryThreshold; +}Then check and trigger the callback in
collect()aftergetBitArray()returns.
🧹 Nitpick comments (4)
test/framework/src/main/java/org/opensearch/search/aggregations/AggregatorTestCase.java (1)
549-551: Consider aligning test defaults with production defaults.The mock sets
hybridCollectorEnabled=false, but the production settingCARDINALITY_AGGREGATION_HYBRID_COLLECTOR_ENABLEDdefaults totrue. While this is acceptable for backward compatibility of existing tests, consider documenting this discrepancy or using a consistent default to ensure tests exercise the default production code path.- when(searchContext.cardinalityAggregationContext()).thenReturn( - new org.opensearch.search.aggregations.metrics.CardinalityAggregationContext(false, Runtime.getRuntime().maxMemory() / 100) - ); + // Note: hybridCollectorEnabled=false differs from production default (true) to maintain backward compatibility for existing tests + when(searchContext.cardinalityAggregationContext()).thenReturn( + new org.opensearch.search.aggregations.metrics.CardinalityAggregationContext(false, Runtime.getRuntime().maxMemory() / 100) + );server/src/test/java/org/opensearch/search/aggregations/metrics/CardinalityAggregatorTests.java (1)
703-748: Good test coverage for hybrid collector behavior.The three new tests provide solid coverage:
testHybridCollectorEnabledWithKeywordField- verifies HybridCollector wraps OrdinalsCollector when memory is under thresholdtestHybridCollectorDisabledWithKeywordField- verifies fallback to direct OrdinalsCollector when hybrid is disabledtestHybridCollectorMemoryThresholdExceeded- verifies switching to DirectCollector when threshold is exceededOne minor observation: the
testHybridCollectorMemoryThresholdExceededtest usesmemoryThreshold=1byte which will trigger switching immediately. Consider adding a comment explaining this intentional behavior.public void testHybridCollectorMemoryThresholdExceeded() throws IOException { MappedFieldType fieldType = new KeywordFieldMapper.KeywordFieldType("field"); final CardinalityAggregationBuilder aggregationBuilder = new CardinalityAggregationBuilder("_name").field("field"); testAggregationHybridCollector(aggregationBuilder, new MatchAllDocsQuery(), iw -> { // Add many documents to potentially exceed memory threshold for (int i = 0; i < 1000; i++) { iw.addDocument(singleton(new SortedSetDocValuesField("field", new BytesRef("value" + i)))); } }, card -> { assertEquals(1000.0, card.getValue(), 10.0); assertTrue(AggregationInspectionHelper.hasValue(card)); }, collector -> { assertTrue(collector instanceof HybridCollector); assertTrue(((HybridCollector) collector).getActiveCollector() instanceof CardinalityAggregator.DirectCollector); - }, fieldType, true, 1); // Very low threshold to trigger switching + }, fieldType, true, 1L); // 1 byte threshold forces immediate switch to DirectCollector }server/src/main/java/org/opensearch/search/aggregations/metrics/CardinalityAggregationContext.java (1)
19-41: Clean encapsulation of cardinality settings.The POJO design is clean and follows the pattern suggested in past review comments to reduce
SearchContextbloat. The factory method correctly handlesByteSizeValueconversion.Consider adding input validation in the constructor to reject negative
memoryThresholdvalues:public CardinalityAggregationContext(boolean hybridCollectorEnabled, long memoryThreshold) { + if (memoryThreshold < 0) { + throw new IllegalArgumentException("memoryThreshold must be non-negative"); + } this.hybridCollectorEnabled = hybridCollectorEnabled; this.memoryThreshold = memoryThreshold; }server/src/main/java/org/opensearch/search/aggregations/metrics/CardinalityAggregator.java (1)
890-898: Wrapping IOException in RuntimeException may mask errors.The callback wraps
IOExceptionin an uncheckedRuntimeException. While this is sometimes necessary for functional interfaces, consider:
- This exception will propagate up through
OrdinalsCollector.collect()→HybridCollector.collect()as an unchecked exception- The error message "Failed to switch to direct collector" doesn't preserve full context
Consider using a custom unchecked exception or OpenSearch's
UncheckedIOException:private void handleMemoryLimitReached(int doc, long bucketOrd) { try { switchToDirectCollector(); // Collect the document that triggered the switch activeCollector.collect(doc, bucketOrd); } catch (IOException e) { - throw new RuntimeException("Failed to switch to direct collector", e); + throw new java.io.UncheckedIOException("Failed to switch to direct collector", e); } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
CHANGELOG.md(1 hunks)rest-api-spec/src/main/resources/rest-api-spec/test/search.aggregation/170_cardinality_metric.yml(2 hunks)rest-api-spec/src/main/resources/rest-api-spec/test/search.aggregation/170_cardinality_metric_unsigned.yml(2 hunks)server/src/main/java/org/opensearch/common/settings/ClusterSettings.java(2 hunks)server/src/main/java/org/opensearch/search/DefaultSearchContext.java(6 hunks)server/src/main/java/org/opensearch/search/aggregations/metrics/CardinalityAggregationContext.java(1 hunks)server/src/main/java/org/opensearch/search/aggregations/metrics/CardinalityAggregator.java(10 hunks)server/src/main/java/org/opensearch/search/internal/SearchContext.java(2 hunks)server/src/test/java/org/opensearch/search/aggregations/metrics/CardinalityAggregatorTests.java(2 hunks)test/framework/src/main/java/org/opensearch/search/aggregations/AggregatorTestCase.java(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/main/java/org/opensearch/common/settings/ClusterSettings.java (1)
server/src/main/java/org/opensearch/search/aggregations/metrics/CardinalityAggregator.java (1)
CardinalityAggregator(91-926)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: gradle-check
- GitHub Check: Mend Security Check
🔇 Additional comments (12)
server/src/main/java/org/opensearch/common/settings/ClusterSettings.java (1)
165-165: LGTM!The new import and cluster settings for the hybrid cardinality collector are correctly added. The settings are appropriately grouped near the existing
CARDINALITY_AGGREGATION_PRUNING_THRESHOLDsetting.Also applies to: 592-593
CHANGELOG.md (1)
30-30: LGTM!The changelog entry is correctly formatted and placed in the appropriate "Added" section for the unreleased 3.x version.
rest-api-spec/src/main/resources/rest-api-spec/test/search.aggregation/170_cardinality_metric_unsigned.yml (1)
268-291: LGTM!The test updates correctly reflect the new hybrid collector behavior:
- Skip block appropriately excludes versions before 3.4.0 when the feature wasn't available
- Assertion changes correctly verify that the hybrid collector path is used (
hybrid_collectors_used > 0) instead of the direct ordinals collector path (ordinals_collectors_used = 0)rest-api-spec/src/main/resources/rest-api-spec/test/search.aggregation/170_cardinality_metric.yml (1)
271-293: LGTM!The test updates are consistent with the unsigned variant and correctly reflect the hybrid collector behavior. The version skip has been updated to align with the 3.4.0 release as discussed in previous review comments.
server/src/test/java/org/opensearch/search/aggregations/metrics/CardinalityAggregatorTests.java (2)
72-73: LGTM!The new imports for
HybridCollectorandOrdinalsCollectorare correctly added to support the new hybrid collector tests.
640-701: LGTM - well-structured test helper.The
testAggregationHybridCollectorhelper method follows the established pattern fromtestAggregationExecutionHintand provides a clean way to test hybrid collector behavior with configurablehybridCollectorEnabledandmemoryThresholdparameters.server/src/main/java/org/opensearch/search/internal/SearchContext.java (1)
559-561: LGTM - Sensible default for base class.The default implementation correctly provides a fallback context with hybrid collector disabled and a conservative memory threshold (1% of max heap). This will be overridden by
DefaultSearchContextwhen cluster settings are available.server/src/main/java/org/opensearch/search/DefaultSearchContext.java (1)
1268-1277: LGTM - Consistent with existing setting evaluation patterns.The implementation correctly follows the established pattern for evaluating cluster settings in this class. The fallback when
clusterServiceis null matches the base class default.server/src/main/java/org/opensearch/search/aggregations/metrics/CardinalityAggregator.java (4)
185-203: LGTM - Clean integration of hybrid collector path.The logic correctly branches based on
hybridCollectorEnabled:
- When enabled: Creates
HybridCollectorwith pre-computed hash values for potential fallback- When disabled: Falls back to existing memory-based heuristic between
OrdinalsCollectorandHyperLogLogPlusPlus
900-919: LGTM - Resource management is correct.The resource lifecycle is handled properly:
- On switch:
ordinalsCollectoris post-collected and closed immediately,activeCollectorbecomesDirectCollector- On close: only
activeCollectoris closed (avoiding double-close ofordinalsCollector)
608-629: Good backward compatibility with constructor overloading.The original constructor delegates with
Long.MAX_VALUEthreshold and emptyOptional, ensuring existing callers have no behavioral change or overhead.
95-115: Verify settings registration in ClusterSettings.java.The new settings are well-designed with appropriate defaults and properties. Ensure
CARDINALITY_AGGREGATION_HYBRID_COLLECTOR_ENABLEDandCARDINALITY_AGGREGATION_HYBRID_COLLECTOR_MEMORY_THRESHOLDare registered inClusterSettings.javaso they take effect at runtime.
|
❌ Gradle check result for eb6724e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Hello! |
eb6724e to
b077a63
Compare
|
❕ Gradle check result for b077a63: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Description
Current cardinality aggregator logic selects DirectCollector over OrdinalsCollector when relative memory overhead due to OrdinalsCollector (compared to DirectCollector) is higher. Because of this relative memory consumption logic, DirectCollector is selected for high cardinality aggregation queries. DirectCollector is slower compared to OrdinalsCollector. This default selection leads to higher search latency even when Opensearch process have available memory to use ordinals collector for faster query performance.
There is no way to figure out memory requirement for nested aggregation because number of buckets are dynamically created as we traverse through all the matching document ids. To overcome this limitation, this change create a hybrid collector which will first use Ordinals Collector and will switch to DirectCollector if memory usage for Ordinals Collector Increase beyond certain threshold. When Hybrid collector switch from Ordinals Collector to Direct Collector, it will utilize already computed aggregation data from Ordinals Collector so that we do not have to rebuild aggregation result using Direct Collector.
Signed-off-by: Anand Pravinbhai Patel anapat@amazon.com
Related Issues
Resolves #19260
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
New Features
search.aggregations.cardinality.hybrid_collector.enabledandsearch.aggregations.cardinality.hybrid_collector.memory_threshold.Chores
✏️ Tip: You can customize this high-level summary in your review settings.