Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add base class for merging and accumulating custom objects #12685

Merged
merged 1 commit into from
Mar 25, 2024

Conversation

davecromberge
Copy link
Member

@davecromberge davecromberge commented Mar 21, 2024

Adds a base class for common capabilities to allow any custom object to use an internal array buffer at query time to offset the cost of expensive merges for more memory usage. The Apache Datasketches Theta, Tuple and Cpc sketches initially make use of this functionality.

The changes in this PR introduce a different intermediate result type for the Datasketches CPC and Tuple aggregation functions. The change has already been introduced for Theta sketches in #12042. The reason for using merges with more inputs so is to amortize the cost of intermediate bookkeeping datastructures necessary to perform unions on two sketches at a time. The end user is in control of the degree to which the internal inputs accumulate before being merged.

For Tuple sketches, the "early-stop" optimisation in set operations circumvents further processing when retained items fall above the minimum theta value (Broder rule). This applies to other set operation expressions as well.

When using JMH benchmarks to simulate this scenario, the speedup achieved by accumulating sketches prior to union is often an improvement by a factor of 3.

Reference:
apache/datasketches-java#326 (comment)
https://datasketches.apache.org/docs/Theta/ThetaSize.html

Note:
This PR should be tagged as upgrade-incompat for users who are making use of the ApacheDatasketches Tuple sketch.

Release notes:

  • additional parameters are now permitted for the distinctCountCpcSketch and distinctCountTupleSketch query aggregation functions to control merge thresholds.

This allows any custom object to make use of an internal array buffer at
query time to offset the cost of expensive merges for more memory usage.
The Apache Datasketches Theta, Tuple and Cpc sketches initially make
use of this functionality.
@codecov-commenter
Copy link

codecov-commenter commented Mar 21, 2024

Codecov Report

Attention: Patch coverage is 30.20833% with 201 lines in your changes are missing coverage. Please review.

Project coverage is 61.42%. Comparing base (59551e4) to head (4035d4e).
Report is 154 commits behind head on master.

Files Patch % Lines
...ion/DistinctCountCPCSketchAggregationFunction.java 2.17% 90 Missing ⚠️
...unction/IntegerTupleSketchAggregationFunction.java 2.63% 74 Missing ⚠️
...nt/local/customobject/CustomObjectAccumulator.java 75.00% 5 Missing and 2 partials ⚠️
.../local/customobject/TupleIntSketchAccumulator.java 69.56% 6 Missing and 1 partial ⚠️
...AvgValueIntegerTupleSketchAggregationFunction.java 0.00% 6 Missing ⚠️
...gment/local/customobject/CpcSketchAccumulator.java 73.68% 4 Missing and 1 partial ⚠️
...nctCountIntegerTupleSketchAggregationFunction.java 0.00% 4 Missing ⚠️
...umValuesIntegerTupleSketchAggregationFunction.java 0.00% 4 Missing ⚠️
.../DistinctCountRawCPCSketchAggregationFunction.java 0.00% 3 Missing ⚠️
...org/apache/pinot/core/common/ObjectSerDeUtils.java 96.77% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12685      +/-   ##
============================================
- Coverage     61.75%   61.42%   -0.33%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2455      +19     
  Lines        133233   134014     +781     
  Branches      20636    20752     +116     
============================================
+ Hits          82274    82323      +49     
- Misses        44911    45526     +615     
- Partials       6048     6165     +117     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.36% <30.20%> (-0.35%) ⬇️
java-21 61.30% <30.20%> (-0.32%) ⬇️
skip-bytebuffers-false 61.39% <30.20%> (-0.36%) ⬇️
skip-bytebuffers-true 61.27% <30.20%> (+33.55%) ⬆️
temurin 61.42% <30.20%> (-0.33%) ⬇️
unittests 61.42% <30.20%> (-0.33%) ⬇️
unittests1 46.30% <11.80%> (-0.59%) ⬇️
unittests2 27.74% <18.40%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Jackie-Jiang Jackie-Jiang added enhancement backward-incompat Referenced by PRs that introduce or fix backward compat issues documentation release-notes Referenced by PRs that need attention when compiling the next release notes labels Mar 21, 2024
@Jackie-Jiang
Copy link
Contributor

Can you please describe more about the backward incompatibility, and also the enhancement brought by this PR?

@davecromberge
Copy link
Member Author

davecromberge commented Mar 22, 2024

@Jackie-Jiang I will update the description with more details as to why this change is necessary for Datasketches.

Concerning backward-incompatibility - I think it is probably more accurate to mark the PR as upgrade-incompatibility.

This concern only applies to Tuple Sketches, which I believe are unused. During the upgrade process, upgraded servers might return accumulator objects to brokers on a current(older) version which still expect sketches. Or, upgraded brokers might expect accumulator objects and servers on the current(older) version might return sketches.
This scenario played out previously for Theta sketches and the workaround is to do a runtime type check on intermediate results within the aggregation function - see #12288.

This PR does a runtime type check and conversion for CPC sketches as they are more likely to be used, and are in fact being used in our production environment. However, this is not the case for Tuple sketches because of the prior implementation.

In order to verify these cases in the compatibility verifier, we need to have more than one server as part of the test cluster (see #12296).

@Jackie-Jiang Jackie-Jiang merged commit 8404bc8 into apache:master Mar 25, 2024
19 checks passed
@Jackie-Jiang
Copy link
Contributor

Can you please help update the Pinot doc about this function?

@davecromberge
Copy link
Member Author

Can you please help update the Pinot doc about this function?

@Jackie-Jiang here is the accompanying pull request to the documentation site:
pinot-contrib/pinot-docs#314

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backward-incompat Referenced by PRs that introduce or fix backward compat issues documentation enhancement release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants