Skip to content

Commit

Permalink
feat: implement TDigest for approx quantile
Browse files Browse the repository at this point in the history
Adds a [TDigest] implementation providing approximate quantile
estimations of large inputs using a small amount of (bounded) memory.

A TDigest is most accurate near either "end" of the quantile range (that
is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that
increases resolution at the tails. The paper claims single digit part
per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and
in practice I have found accuracy to be more than acceptable for an
apprixmate function across the entire quantile range.

The implementation is a modified copy of
https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++
implementation]. Both Facebook's implementation, and Mn02's Rust port
are Apache 2.0 licensed.

[TDigest]: https://arxiv.org/abs/1902.04023
[Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h
  • Loading branch information
domodwyer committed Jan 11, 2022
1 parent b05feda commit b72d21c
Show file tree
Hide file tree
Showing 2 changed files with 819 additions and 0 deletions.
1 change: 1 addition & 0 deletions datafusion/src/physical_plan/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -655,6 +655,7 @@ pub mod sort;
pub mod sort_preserving_merge;
pub mod stream;
pub mod string_expressions;
pub(crate) mod tdigest;
pub mod type_coercion;
pub mod udaf;
pub mod udf;
Expand Down
Loading

0 comments on commit b72d21c

Please sign in to comment.