-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add base class for merging and accumulating custom objects #12685
Add base class for merging and accumulating custom objects #12685
Conversation
This allows any custom object to make use of an internal array buffer at query time to offset the cost of expensive merges for more memory usage. The Apache Datasketches Theta, Tuple and Cpc sketches initially make use of this functionality.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12685 +/- ##
============================================
- Coverage 61.75% 61.42% -0.33%
+ Complexity 207 198 -9
============================================
Files 2436 2455 +19
Lines 133233 134014 +781
Branches 20636 20752 +116
============================================
+ Hits 82274 82323 +49
- Misses 44911 45526 +615
- Partials 6048 6165 +117
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Can you please describe more about the backward incompatibility, and also the enhancement brought by this PR? |
@Jackie-Jiang I will update the description with more details as to why this change is necessary for Datasketches. Concerning This concern only applies to Tuple Sketches, which I believe are unused. During the upgrade process, upgraded servers might return accumulator objects to brokers on a current(older) version which still expect sketches. Or, upgraded brokers might expect accumulator objects and servers on the current(older) version might return sketches. This PR does a runtime type check and conversion for CPC sketches as they are more likely to be used, and are in fact being used in our production environment. However, this is not the case for Tuple sketches because of the prior implementation. In order to verify these cases in the compatibility verifier, we need to have more than one server as part of the test cluster (see #12296). |
Can you please help update the Pinot doc about this function? |
@Jackie-Jiang here is the accompanying pull request to the documentation site: |
Adds a base class for common capabilities to allow any custom object to use an internal array buffer at query time to offset the cost of expensive merges for more memory usage. The Apache Datasketches Theta, Tuple and Cpc sketches initially make use of this functionality.
The changes in this PR introduce a different intermediate result type for the Datasketches CPC and Tuple aggregation functions. The change has already been introduced for Theta sketches in #12042. The reason for using merges with more inputs so is to amortize the cost of intermediate bookkeeping datastructures necessary to perform unions on two sketches at a time. The end user is in control of the degree to which the internal inputs accumulate before being merged.
For Tuple sketches, the "early-stop" optimisation in set operations circumvents further processing when retained items fall above the minimum theta value (Broder rule). This applies to other set operation expressions as well.
When using JMH benchmarks to simulate this scenario, the speedup achieved by accumulating sketches prior to union is often an improvement by a factor of 3.
Reference:
apache/datasketches-java#326 (comment)
https://datasketches.apache.org/docs/Theta/ThetaSize.html
Note:
This PR should be tagged as
upgrade-incompat
for users who are making use of the ApacheDatasketches Tuple sketch.Release notes:
distinctCountCpcSketch
anddistinctCountTupleSketch
query aggregation functions to control merge thresholds.