[SPARK-54576][SQL] Add documentation for new Datasketches-based aggregate functions#53297
Closed
dtenedor wants to merge 1 commit intoapache:masterfrom
Closed
[SPARK-54576][SQL] Add documentation for new Datasketches-based aggregate functions#53297dtenedor wants to merge 1 commit intoapache:masterfrom
dtenedor wants to merge 1 commit intoapache:masterfrom
Conversation
Contributor
Author
|
cc @cboumalh I added some Spark documentation for the new Datasketches based aggregate functions we have so far. We can maybe keep extending this as well as we add new functions later. |
Contributor
|
Just read the doc, it looks great. Thank you for adding this. Will add more info about tuple once we have that! |
cboumalh
approved these changes
Dec 3, 2025
allisonwang-db
approved these changes
Dec 3, 2025
Contributor
Author
|
Thanks all for review! I did another check now and everything looks OK. Merging to master. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR adds comprehensive documentation for Spark SQL's sketch-based approximate functions powered by the Apache DataSketches library. The new documentation page (
sql-ref-sketch-aggregates.md) covers:Function Reference:
hll_sketch_agg,hll_union_agg,hll_sketch_estimate,hll_uniontheta_sketch_agg,theta_union_agg,theta_intersection_agg,theta_sketch_estimate,theta_union,theta_intersection,theta_differencekll_sketch_agg_*,kll_sketch_to_string_*,kll_sketch_get_n_*,kll_sketch_merge_*,kll_sketch_get_quantile_*,kll_sketch_get_rank_*approx_top_k_accumulate,approx_top_k_combine,approx_top_k_estimateBest Practices:
Common Use Cases and Examples:
The PR also adds links to this new documentation page from:
sql-ref-functions.md(under Aggregate-like Functions)sql-ref.md(under Functions section)_data/menu-sql.yaml(navigation menu)Why are the changes needed?
Spark SQL has added several sketch-based approximate functions using the Apache DataSketches library (HLL sketches in 3.5.0, Theta/KLL/Top-K sketches in 4.1.0), but there was no comprehensive documentation explaining:
This documentation fills that gap and helps users understand the full power of sketch-based analytics in Spark SQL.
Does this PR introduce any user-facing change?
Yes, this PR adds new documentation pages that are user-facing. No code changes are included.
How was this patch tested?
Documentation-only change. The examples were verified against the existing function implementations and test cases in the codebase.
Was this patch authored or co-authored using generative AI tooling?
Yes, code assistance with
claude-4.5-opus-highin combination with manual editing by the author.