Add base class for merging and accumulating custom objects #12685

davecromberge · 2024-03-21T09:03:05Z

Adds a base class for common capabilities to allow any custom object to use an internal array buffer at query time to offset the cost of expensive merges for more memory usage. The Apache Datasketches Theta, Tuple and Cpc sketches initially make use of this functionality.

The changes in this PR introduce a different intermediate result type for the Datasketches CPC and Tuple aggregation functions. The change has already been introduced for Theta sketches in #12042. The reason for using merges with more inputs so is to amortize the cost of intermediate bookkeeping datastructures necessary to perform unions on two sketches at a time. The end user is in control of the degree to which the internal inputs accumulate before being merged.

For Tuple sketches, the "early-stop" optimisation in set operations circumvents further processing when retained items fall above the minimum theta value (Broder rule). This applies to other set operation expressions as well.

When using JMH benchmarks to simulate this scenario, the speedup achieved by accumulating sketches prior to union is often an improvement by a factor of 3.

Reference:
apache/datasketches-java#326 (comment)
https://datasketches.apache.org/docs/Theta/ThetaSize.html

Note:
This PR should be tagged as upgrade-incompat for users who are making use of the ApacheDatasketches Tuple sketch.

Release notes:

additional parameters are now permitted for the distinctCountCpcSketch and distinctCountTupleSketch query aggregation functions to control merge thresholds.

This allows any custom object to make use of an internal array buffer at query time to offset the cost of expensive merges for more memory usage. The Apache Datasketches Theta, Tuple and Cpc sketches initially make use of this functionality.

codecov-commenter · 2024-03-21T09:39:39Z

Codecov Report

Attention: Patch coverage is 30.20833% with 201 lines in your changes are missing coverage. Please review.

Project coverage is 61.42%. Comparing base (59551e4) to head (4035d4e).
Report is 154 commits behind head on master.

Files	Patch %	Lines
...ion/DistinctCountCPCSketchAggregationFunction.java	2.17%	90 Missing ⚠️
...unction/IntegerTupleSketchAggregationFunction.java	2.63%	74 Missing ⚠️
...nt/local/customobject/CustomObjectAccumulator.java	75.00%	5 Missing and 2 partials ⚠️
.../local/customobject/TupleIntSketchAccumulator.java	69.56%	6 Missing and 1 partial ⚠️
...AvgValueIntegerTupleSketchAggregationFunction.java	0.00%	6 Missing ⚠️
...gment/local/customobject/CpcSketchAccumulator.java	73.68%	4 Missing and 1 partial ⚠️
...nctCountIntegerTupleSketchAggregationFunction.java	0.00%	4 Missing ⚠️
...umValuesIntegerTupleSketchAggregationFunction.java	0.00%	4 Missing ⚠️
.../DistinctCountRawCPCSketchAggregationFunction.java	0.00%	3 Missing ⚠️
...org/apache/pinot/core/common/ObjectSerDeUtils.java	96.77%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #12685      +/-   ##
============================================
- Coverage     61.75%   61.42%   -0.33%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2455      +19     
  Lines        133233   134014     +781     
  Branches      20636    20752     +116     
============================================
+ Hits          82274    82323      +49     
- Misses        44911    45526     +615     
- Partials       6048     6165     +117

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`61.36% <30.20%> (-0.35%)`	⬇️
java-21	`61.30% <30.20%> (-0.32%)`	⬇️
skip-bytebuffers-false	`61.39% <30.20%> (-0.36%)`	⬇️
skip-bytebuffers-true	`61.27% <30.20%> (+33.55%)`	⬆️
temurin	`61.42% <30.20%> (-0.33%)`	⬇️
unittests	`61.42% <30.20%> (-0.33%)`	⬇️
unittests1	`46.30% <11.80%> (-0.59%)`	⬇️
unittests2	`27.74% <18.40%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Jackie-Jiang · 2024-03-21T22:49:34Z

Can you please describe more about the backward incompatibility, and also the enhancement brought by this PR?

davecromberge · 2024-03-22T13:39:02Z

@Jackie-Jiang I will update the description with more details as to why this change is necessary for Datasketches.

Concerning backward-incompatibility - I think it is probably more accurate to mark the PR as upgrade-incompatibility.

This concern only applies to Tuple Sketches, which I believe are unused. During the upgrade process, upgraded servers might return accumulator objects to brokers on a current(older) version which still expect sketches. Or, upgraded brokers might expect accumulator objects and servers on the current(older) version might return sketches.
This scenario played out previously for Theta sketches and the workaround is to do a runtime type check on intermediate results within the aggregation function - see #12288.

This PR does a runtime type check and conversion for CPC sketches as they are more likely to be used, and are in fact being used in our production environment. However, this is not the case for Tuple sketches because of the prior implementation.

In order to verify these cases in the compatibility verifier, we need to have more than one server as part of the test cluster (see #12296).

Jackie-Jiang · 2024-03-25T23:34:04Z

Can you please help update the Pinot doc about this function?

davecromberge · 2024-04-09T13:58:24Z

Can you please help update the Pinot doc about this function?

@Jackie-Jiang here is the accompanying pull request to the documentation site:
pinot-contrib/pinot-docs#314

Add base class for custom object accumulator

4035d4e

This allows any custom object to make use of an internal array buffer at query time to offset the cost of expensive merges for more memory usage. The Apache Datasketches Theta, Tuple and Cpc sketches initially make use of this functionality.

Jackie-Jiang added enhancement backward-incompat Referenced by PRs that introduce or fix backward compat issues documentation release-notes Referenced by PRs that need attention when compiling the next release notes labels Mar 21, 2024

Jackie-Jiang approved these changes Mar 25, 2024

View reviewed changes

Jackie-Jiang merged commit 8404bc8 into apache:master Mar 25, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add base class for merging and accumulating custom objects #12685

Add base class for merging and accumulating custom objects #12685

davecromberge commented Mar 21, 2024 •

edited

Loading

codecov-commenter commented Mar 21, 2024 •

edited

Loading

Jackie-Jiang commented Mar 21, 2024

davecromberge commented Mar 22, 2024 •

edited

Loading

Jackie-Jiang commented Mar 25, 2024

davecromberge commented Apr 9, 2024

Add base class for merging and accumulating custom objects #12685

Add base class for merging and accumulating custom objects #12685

Conversation

davecromberge commented Mar 21, 2024 • edited Loading

codecov-commenter commented Mar 21, 2024 • edited Loading

Codecov Report

Jackie-Jiang commented Mar 21, 2024

davecromberge commented Mar 22, 2024 • edited Loading

Jackie-Jiang commented Mar 25, 2024

davecromberge commented Apr 9, 2024

davecromberge commented Mar 21, 2024 •

edited

Loading

codecov-commenter commented Mar 21, 2024 •

edited

Loading

davecromberge commented Mar 22, 2024 •

edited

Loading