Add `approx_percentile_cont()` aggregation function #1539

domodwyer · 2022-01-10T14:53:22Z

Which issue does this PR close?

Closes #1538.

What changes are included in this PR?

A new approx_quantile() aggregation function.

This PR also includes support for aggreagte functions with multiple arguments - all existing aggregations had a signature such as fn(column), whereas this fn requires approx_quantile(column, quantile).

This involved two main changes to existing code:

Support for TypeSignature::OneOf for aggregation functions
Changing the protobuf format to allow AggregateExprNode to carry multiple LogicalExprNode (see commit messages)

I'd like to draw attention to the fact this TDigest impl is largely taken from (the Apache 2.0 licensed) work by MnO2 and modified to fit datafusion efficiently - itself a port of Facebook's C++ implementation. I've included this information in a comment in the file, but let me know if I should provide attribution in some other way too!

Are there any user-facing changes?

A new approx_quantile() aggregation is available to the user, both via SQL and the dataframe API.

I added the approx_quantile() aggregation to the prelude - let me know if that's not right.

Future

I did not spend much time trying to optimise this, and I suspect it can be polished up a bit but wanted to land something working first. I may also investiage the uddsketch algorithm in the future and potentially swap out the tdigest approach used here if it proves to be superior.

Should I include documenting this new fn anywhere as part of the PR?

feat: implement TDigest for approx quantile (b72d21c)

Adds a [TDigest] implementation providing approximate quantile estimations of
large inputs using a small amount of (bounded) memory.

A TDigest is most accurate near either "end" of the quantile range (that is,
0.1, 0.9, 0.95, etc) due to the use of a scalaing function that increases
resolution at the tails. The paper claims single digit part per million errors
for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and in practice I have found
accuracy to be more than acceptable for an apprixmate function across the
entire quantile range.

The implementation is a modified copy of https://github.com/MnO2/t-digest,
itself a Rust port of [Facebook's C++ implementation]. Both Facebook's
implementation, and Mn02's Rust port are Apache 2.0 licensed.

[TDigest]: https://arxiv.org/abs/1902.04023
[Facebook's C++ implementation]:
https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h

feat: approx_quantile aggregation (d5fc006)

Adds the ApproxQuantile physical expression, plumbing & test cases.

The function signature is:

	approx_quantile(column, quantile)

Where column can be any numeric type (that can be cast to a float64) and 
quantile is a float64 literal between 0 and 1.

feat: approx_quantile dataframe function (8714293)

Adds the approx_quantile() dataframe function, and exports it in the prelude.

refactor: bastilla approx_quantile support (b4ff5b3)

Adds bastilla wire encoding for approx_quantile.

Adding support for this required modifying the AggregateExprNode proto message
to support propigating multiple LogicalExprNode aggregate arguments - all the
existing aggregations take a single argument, so this wasn't needed before.

This commit adds "repeated" to the expr field, which I believe is backwards
compatible as described here:

	https://developers.google.com/protocol-buffers/docs/proto3#updating

Specifically, adding "repeated" to an existing message field:

	"For ... message fields, optional is compatible with repeated"

No existing tests needed fixing, and a new roundtrip test is included that
covers the change to allow multiple expr.

refactor: use input type as return type (d72ca10)

Casts the calculated quantile value to the same type as the input data.

hntd187 · 2022-01-10T21:09:32Z

datafusion/src/physical_plan/mod.rs

@@ -640,6 +640,7 @@ pub mod sort;
 pub mod sort_preserving_merge;
 pub mod stream;
 pub mod string_expressions;
+pub(crate) mod tdigest;


Is there a reason this is pub(crate) while everything else is open?

I modelled approx_quantile() after approx_distinct() which uses HyperLogLog internally - the hyperloglog mod is also pub(crate) so I did the same - I can change tdigest to pub if that makes more sense?

Should hyperloglog also be pub?

I think pub(crate) is a good place to start, and we can make it pub later on if there is some usecase

ballista/rust/core/src/serde/logical_plan/mod.rs

datafusion/src/physical_plan/coercion_rule/aggregate_rule.rs

datafusion/src/physical_plan/aggregates.rs

Dandandan · 2022-01-11T21:14:29Z

datafusion/src/physical_plan/tdigest/mod.rs

+                    j += 1;
+                }
+                Ordering::Equal => {
+                    result.push(centroids[i].clone());


The branches are the same for less and equal so I think the whole match could be simplified to

if centroids[i] <= centroids[j] { result.push(centroids[i].clone()); i += 1; } else { result.push(centroids[j].clone()); j += 1; }

realno · 2022-01-12T07:17:59Z

This is a very useful feature, thank you 👍

DataFusion usually try to match Postgres' function definition. So for quantile they use percentile instead, and implemented in two ways percentile_disc and persentile_cont as specified here: https://www.postgresql.org/docs/9.4/functions-aggregate.html

It would be nice we follow the same convention.

domodwyer · 2022-01-16T17:49:16Z

The T-Digest algorithm returns an interpolated result, so I think percentile_cont makes most sense - however T-digest is an approximation of the quantile. Should we include approx in the name to signify this? If so, would approx_percentile_cont be the desired name, keeping in line with the existing approx_distinct aggregate?

realno · 2022-01-16T18:44:48Z

The T-Digest algorithm returns an interpolated result, so I think percentile_cont makes most sense - however T-digest is an approximation of the quantile. Should we include approx in the name to signify this? If so, would approx_percentile_cont be the desired name, keeping in line with the existing approx_distinct aggregate?

This sounds reasonable. I suggest also check if anyone from the community is more familiar with Postgres implementation. I remember I read somewhere its quantile is more efficient than median function, so I assume it uses something like KLL., if that's the case it may not be a big deal if we use approx or not.

datafusion/src/physical_plan/expressions/approx_quantile.rs

realno · 2022-01-19T04:12:43Z

datafusion/tests/sql/aggregates.rs

+// Column `c12` is omitted due to a large relative error (~10%) due to the small
+// float values.
+#[tokio::test]
+async fn csv_query_approx_quantile() -> Result<()> {


@alamb is this test sufficient for testing merge functions?

Merge appears to always be called, so it is definitely exercised - I don't know if this is sufficent coverage though!

Looked at it again, it should be fine since the compute logic is tested in the algo.

datafusion/src/physical_plan/aggregates.rs

datafusion/src/physical_plan/expressions/approx_quantile.rs

realno · 2022-01-19T05:18:26Z

datafusion/src/physical_plan/expressions/approx_quantile.rs

+            v => unreachable!("unexpected return type {:?}", v),
+        })
+    }
+}


I think we can use more unit tests here if necessary.

domodwyer · 2022-01-24T21:08:18Z

Thanks for the reviews - I shall make some time to rebase this and address the comments soon 👍

Adds a [TDigest] implementation providing approximate quantile estimations of large inputs using a small amount of (bounded) memory. A TDigest is most accurate near either "end" of the quantile range (that is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that increases resolution at the tails. The paper claims single digit part per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and in practice I have found accuracy to be more than acceptable for an apprixmate function across the entire quantile range. The implementation is a modified copy of https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++ implementation]. Both Facebook's implementation, and Mn02's Rust port are Apache 2.0 licensed. [TDigest]: https://arxiv.org/abs/1902.04023 [Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h

Adds the ApproxQuantile physical expression, plumbing & test cases. The function signature is: approx_quantile(column, quantile) Where column can be any numeric type (that can be cast to a float64) and quantile is a float64 literal between 0 and 1.

Adds the approx_quantile() dataframe function, and exports it in the prelude.

Adds bastilla wire encoding for approx_quantile. Adding support for this required modifying the AggregateExprNode proto message to support propigating multiple LogicalExprNode aggregate arguments - all the existing aggregations take a single argument, so this wasn't needed before. This commit adds "repeated" to the expr field, which I believe is backwards compatible as described here: https://developers.google.com/protocol-buffers/docs/proto3#updating Specifically, adding "repeated" to an existing message field: "For ... message fields, optional is compatible with repeated" No existing tests needed fixing, and a new roundtrip test is included that covers the change to allow multiple expr.

Casts the calculated quantile value to the same type as the input data.

domodwyer · 2022-01-27T20:39:19Z

I have rebased onto master and made the requested changes - thanks for the reviews! 👍

I still need to rename approx_quantile() to approx_percentile_cont() - I just wanted to confirm this is the desired name before making the changes.

If no one objects, I shall do the rename in a couple of days!

Ensures the quantile values is between 0 and 1, emitting a plan error if not.

domodwyer · 2022-01-29T11:56:42Z

Done!

+-------------------------------------------+
| APPROXPERCENTILECONT(test.b,Float64(0.5)) |
+-------------------------------------------+
| 10                                        |
+-------------------------------------------+

This should be good for another review now - thanks all 👍

realno

LGTM, nice work, thanks! @domodwyer

alamb

👍 nice work @domodwyer . Thank you for the review @realno

I didn't check the math, but I reviewed the structure of this PR and tests.

I also took it for a test drive

cargo run --bin datafusion-cli

❯ create table t1 as select * from (values (11, 'a'), (22, 'b'), (33, 'c'), (44, 'd'), (77, 'e')) as sq;
0 rows in set. Query took 0.016 seconds.
❯ select * from t1;
+---------+---------+
| column1 | column2 |
+---------+---------+
| 11      | a       |
| 22      | b       |
| 33      | c       |
| 44      | d       |
| 77      | e       |
+---------+---------+
5 rows in set. Query took 0.008 seconds.
❯ select approx_percentile_cont(column1, 0.2) from t1;
+-----------------------------------------------+
| APPROXPERCENTILECONT(t1.column1,Float64(0.2)) |
+-----------------------------------------------+
| 16                                            |
+-----------------------------------------------+
1 row in set. Query took 0.012 seconds.
❯ select approx_percentile_cont(column1, 0.2) from t1 group by column2;
+-----------------------------------------------+
| APPROXPERCENTILECONT(t1.column1,Float64(0.2)) |
+-----------------------------------------------+
| 11                                            |
| 44                                            |
| 22                                            |
| 77                                            |
| 33                                            |
+-----------------------------------------------+
5 rows in set. Query took 0.011 seconds.

alamb · 2022-01-31T20:22:18Z

datafusion/src/physical_plan/mod.rs

@@ -640,6 +640,7 @@ pub mod sort;
 pub mod sort_preserving_merge;
 pub mod stream;
 pub mod string_expressions;
+pub(crate) mod tdigest;


I think pub(crate) is a good place to start, and we can make it pub later on if there is some usecase

* feat: add join type for logical plan display (#1674) * (minor) Reduce memory manager and disk manager logs from `info!` to `debug!` (#1689) * Move `information_schema` tests out of execution/context.rs to `sql_integration` tests (#1684) * Move tests from context.rs to information_schema.rs * Fix up tests to compile * Move timestamp related tests out of context.rs and into sql integration test (#1696) * Move some tests out of context.rs and into sql * Move support test out of context.rs and into sql tests * Fixup tests and make them compile * Fix parquet projection * fix pruning casting * fix test based on debug strings * revert read_spill method by getting schema from file * Add `MemTrackingMetrics` to ease memory tracking for non-limited memory consumers (#1691) * Memory manager no longer track consumers, update aggregatedMetricsSet * Easy memory tracking with metrics * use tracking metrics in SPMS * tests * fix * doc * Update datafusion/src/physical_plan/sorts/sort.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * make tracker AtomicUsize Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Implement TableProvider for DataFrameImpl (#1699) * Add TableProvider impl for DataFrameImpl * Add physical plan in * Clean up plan construction and names construction * Remove duplicate comments * Remove unused parameter * Add test * Remove duplicate limit comment * Use cloned instead of individual clone * Reduce the amount of code to get a schema Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add comments to test * Fix plan comparison * Compare only the results of execution * Remove println * Refer to df_impl instead of table in test Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fix the register_table test to use the correct result set for comparison * Consolidate group/agg exprs * Format * Remove outdated comment Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * refine test in repartition.rs & coalesce_batches.rs (#1707) * Fuzz test for spillable sort (#1706) * Lazy TempDir creation in DiskManager (#1695) * Incorporate dyn scalar kernels (#1685) * Rebase * impl ToNumeric for ScalarValue * Update macro to be based on * Add floats * Cleanup * Newline * add annotation for select_to_plan (#1714) * Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed (#1700) * Change physical_expr creation API * Refactor API usage to avoid creating ExecutionContextState * Fixup ballista * clippy! * Fix can not load parquet table form spark in datafusion-cli. (#1665) * fix can not load parquet table form spark * add Invalid file in log. * fix fmt * add upper bound for pub fn (#1713) Signed-off-by: remzi <13716567376yh@gmail.com> * Create SchemaAdapter trait to map table schema to file schemas (#1709) * Create SchemaAdapter trait to map table schema to file schemas * Linting fix * Remove commented code * approx_quantile() aggregation function (#1539) * feat: implement TDigest for approx quantile Adds a [TDigest] implementation providing approximate quantile estimations of large inputs using a small amount of (bounded) memory. A TDigest is most accurate near either "end" of the quantile range (that is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that increases resolution at the tails. The paper claims single digit part per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and in practice I have found accuracy to be more than acceptable for an apprixmate function across the entire quantile range. The implementation is a modified copy of https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++ implementation]. Both Facebook's implementation, and Mn02's Rust port are Apache 2.0 licensed. [TDigest]: https://arxiv.org/abs/1902.04023 [Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h * feat: approx_quantile aggregation Adds the ApproxQuantile physical expression, plumbing & test cases. The function signature is: approx_quantile(column, quantile) Where column can be any numeric type (that can be cast to a float64) and quantile is a float64 literal between 0 and 1. * feat: approx_quantile dataframe function Adds the approx_quantile() dataframe function, and exports it in the prelude. * refactor: bastilla approx_quantile support Adds bastilla wire encoding for approx_quantile. Adding support for this required modifying the AggregateExprNode proto message to support propigating multiple LogicalExprNode aggregate arguments - all the existing aggregations take a single argument, so this wasn't needed before. This commit adds "repeated" to the expr field, which I believe is backwards compatible as described here: https://developers.google.com/protocol-buffers/docs/proto3#updating Specifically, adding "repeated" to an existing message field: "For ... message fields, optional is compatible with repeated" No existing tests needed fixing, and a new roundtrip test is included that covers the change to allow multiple expr. * refactor: use input type as return type Casts the calculated quantile value to the same type as the input data. * fixup! refactor: bastilla approx_quantile support * refactor: rebase onto main * refactor: validate quantile value Ensures the quantile values is between 0 and 1, emitting a plan error if not. * refactor: rename to approx_percentile_cont * refactor: clippy lints * suppport bitwise and as an example (#1653) * suppport bitwise and as an example * Use $OP in macro rather than `&` * fix: change signature to &dyn Array * fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix: substr - correct behaivour with negative start pos (#1660) * minor: fix cargo run --release error (#1723) * Convert boolean case expressions to boolean logic (#1719) * Convert boolean case expressions to boolean logic * Review feedback * substitute `parking_lot::Mutex` for `std::sync::Mutex` (#1720) * Substitute parking_lot::Mutex for std::sync::Mutex * enable parking_lot feature in tokio * Add Expression Simplification API (#1717) * Add Expression Simplification API * fmt * use from_slice(&[T]) instead of from_slice(Vec<T>) to prevent future merge conflicts * fix decimal add because arrow2 doesn't include decimal add in arithmetics::add * fix decimal scale for cast test * fix parquet file format adapted projection by providing the proper schema to the RecordBatch Co-authored-by: xudong.w <wxd963996380@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com> Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Co-authored-by: Matthew Turner <matthew.m.turner@outlook.com> Co-authored-by: Yang <37145547+Ted-Jiang@users.noreply.github.com> Co-authored-by: Remzi Yang <59198230+HaoYang670@users.noreply.github.com> Co-authored-by: Dan Harris <1327726+thinkharderdev@users.noreply.github.com> Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: Kun Liu <liukun@apache.org> Co-authored-by: Dmitry Patsura <talk@dmtry.me> Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

* feat: add join type for logical plan display (#1674) * (minor) Reduce memory manager and disk manager logs from `info!` to `debug!` (#1689) * Move `information_schema` tests out of execution/context.rs to `sql_integration` tests (#1684) * Move tests from context.rs to information_schema.rs * Fix up tests to compile * Move timestamp related tests out of context.rs and into sql integration test (#1696) * Move some tests out of context.rs and into sql * Move support test out of context.rs and into sql tests * Fixup tests and make them compile * Add `MemTrackingMetrics` to ease memory tracking for non-limited memory consumers (#1691) * Memory manager no longer track consumers, update aggregatedMetricsSet * Easy memory tracking with metrics * use tracking metrics in SPMS * tests * fix * doc * Update datafusion/src/physical_plan/sorts/sort.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * make tracker AtomicUsize Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Implement TableProvider for DataFrameImpl (#1699) * Add TableProvider impl for DataFrameImpl * Add physical plan in * Clean up plan construction and names construction * Remove duplicate comments * Remove unused parameter * Add test * Remove duplicate limit comment * Use cloned instead of individual clone * Reduce the amount of code to get a schema Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add comments to test * Fix plan comparison * Compare only the results of execution * Remove println * Refer to df_impl instead of table in test Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fix the register_table test to use the correct result set for comparison * Consolidate group/agg exprs * Format * Remove outdated comment Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * refine test in repartition.rs & coalesce_batches.rs (#1707) * Fuzz test for spillable sort (#1706) * Lazy TempDir creation in DiskManager (#1695) * Incorporate dyn scalar kernels (#1685) * Rebase * impl ToNumeric for ScalarValue * Update macro to be based on * Add floats * Cleanup * Newline * add annotation for select_to_plan (#1714) * Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed (#1700) * Change physical_expr creation API * Refactor API usage to avoid creating ExecutionContextState * Fixup ballista * clippy! * Fix can not load parquet table form spark in datafusion-cli. (#1665) * fix can not load parquet table form spark * add Invalid file in log. * fix fmt * add upper bound for pub fn (#1713) Signed-off-by: remzi <13716567376yh@gmail.com> * Create SchemaAdapter trait to map table schema to file schemas (#1709) * Create SchemaAdapter trait to map table schema to file schemas * Linting fix * Remove commented code * approx_quantile() aggregation function (#1539) * feat: implement TDigest for approx quantile Adds a [TDigest] implementation providing approximate quantile estimations of large inputs using a small amount of (bounded) memory. A TDigest is most accurate near either "end" of the quantile range (that is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that increases resolution at the tails. The paper claims single digit part per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and in practice I have found accuracy to be more than acceptable for an apprixmate function across the entire quantile range. The implementation is a modified copy of https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++ implementation]. Both Facebook's implementation, and Mn02's Rust port are Apache 2.0 licensed. [TDigest]: https://arxiv.org/abs/1902.04023 [Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h * feat: approx_quantile aggregation Adds the ApproxQuantile physical expression, plumbing & test cases. The function signature is: approx_quantile(column, quantile) Where column can be any numeric type (that can be cast to a float64) and quantile is a float64 literal between 0 and 1. * feat: approx_quantile dataframe function Adds the approx_quantile() dataframe function, and exports it in the prelude. * refactor: bastilla approx_quantile support Adds bastilla wire encoding for approx_quantile. Adding support for this required modifying the AggregateExprNode proto message to support propigating multiple LogicalExprNode aggregate arguments - all the existing aggregations take a single argument, so this wasn't needed before. This commit adds "repeated" to the expr field, which I believe is backwards compatible as described here: https://developers.google.com/protocol-buffers/docs/proto3#updating Specifically, adding "repeated" to an existing message field: "For ... message fields, optional is compatible with repeated" No existing tests needed fixing, and a new roundtrip test is included that covers the change to allow multiple expr. * refactor: use input type as return type Casts the calculated quantile value to the same type as the input data. * fixup! refactor: bastilla approx_quantile support * refactor: rebase onto main * refactor: validate quantile value Ensures the quantile values is between 0 and 1, emitting a plan error if not. * refactor: rename to approx_percentile_cont * refactor: clippy lints * suppport bitwise and as an example (#1653) * suppport bitwise and as an example * Use $OP in macro rather than `&` * fix: change signature to &dyn Array * fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix: substr - correct behaivour with negative start pos (#1660) * minor: fix cargo run --release error (#1723) * Convert boolean case expressions to boolean logic (#1719) * Convert boolean case expressions to boolean logic * Review feedback * substitute `parking_lot::Mutex` for `std::sync::Mutex` (#1720) * Substitute parking_lot::Mutex for std::sync::Mutex * enable parking_lot feature in tokio * Add Expression Simplification API (#1717) * Add Expression Simplification API * fmt * Add tests and CI for optional pyarrow module (#1711) * Implement other side of conversion * Add test workflow * Add (failing) tests * Get unit tests passing * Use python -m pip * Debug LD_LIBRARY_PATH * Set LIBRARY_PATH * Update help with better info * Update parking_lot requirement from 0.11 to 0.12 (#1735) Updates the requirements on [parking_lot](https://github.com/Amanieu/parking_lot) to permit the latest version. - [Release notes](https://github.com/Amanieu/parking_lot/releases) - [Changelog](https://github.com/Amanieu/parking_lot/blob/master/CHANGELOG.md) - [Commits](Amanieu/parking_lot@0.11.0...0.12.0) --- updated-dependencies: - dependency-name: parking_lot dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Prevent repartitioning of certain operator's direct children (#1731) (#1732) * Prevent repartitioning of certain operator's direct children (#1731) * Update ballista tests * Don't repartition children of RepartitionExec * Revert partition restriction on Repartition and Projection * Review feedback * Lint * API to get Expr's type and nullability without a `DFSchema` (#1726) * API to get Expr type and nullability without a `DFSchema` * Add test * publically export * Improve docs * Fix typos in crate documentation (#1739) * add `cargo check --release` to ci (#1737) * remote test * Update .github/workflows/rust.yml Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Move optimize test out of context.rs (#1742) * Move optimize test out of context.rs * Update * use clap 3 style args parsing for datafusion cli (#1749) * use clap 3 style args parsing for datafusion cli * upgrade cli version * Add partitioned_csv setup code to sql_integration test (#1743) * use ordered-float 2.10 (#1756) Signed-off-by: Andy Grove <agrove@apache.org> * #1768 Support TimeUnit::Second in hasher (#1769) * Support TimeUnit::Second in hasher * fix linter * format (#1745) * Create built-in scalar functions programmatically (#1734) * create build-in scalar functions programatically Signed-off-by: remzi <13716567376yh@gmail.com> * solve conflict Signed-off-by: remzi <13716567376yh@gmail.com> * fix spelling mistake Signed-off-by: remzi <13716567376yh@gmail.com> * rename to call_fn Signed-off-by: remzi <13716567376yh@gmail.com> * [split/1] split datafusion-common module (#1751) * split datafusion-common module * pyarrow * Update datafusion-common/README.md Co-authored-by: Andy Grove <agrove@apache.org> * Update datafusion/Cargo.toml * include publishing Co-authored-by: Andy Grove <agrove@apache.org> * fix: Case insensitive unquoted identifiers (#1747) * move dfschema and column (#1758) * add datafusion-expr module (#1759) * move column, dfschema, etc. to common module (#1760) * include window frames and operator into datafusion-expr (#1761) * move signature, type signature, and volatility to split module (#1763) * [split/10] split up expr for rewriting, visiting, and simplification traits (#1774) * split up expr for rewriting, visiting, and simplification * add docs * move built-in scalar functions (#1764) * split expr type and null info to be expr-schemable (#1784) * rewrite predicates before pushing to union inputs (#1781) * move accumulator and columnar value (#1765) * move accumulator and columnar value (#1762) * fix bad data type in test_try_cast_decimal_to_decimal * added projections for avro columns Co-authored-by: xudong.w <wxd963996380@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com> Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Co-authored-by: Matthew Turner <matthew.m.turner@outlook.com> Co-authored-by: Yang <37145547+Ted-Jiang@users.noreply.github.com> Co-authored-by: Remzi Yang <59198230+HaoYang670@users.noreply.github.com> Co-authored-by: Dan Harris <1327726+thinkharderdev@users.noreply.github.com> Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: Kun Liu <liukun@apache.org> Co-authored-by: Dmitry Patsura <talk@dmtry.me> Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> Co-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: r.4ntix <r.4ntix@gmail.com> Co-authored-by: Jiayu Liu <Jimexist@users.noreply.github.com> Co-authored-by: Andy Grove <agrove@apache.org> Co-authored-by: Rich <jychen7@users.noreply.github.com> Co-authored-by: Marko Mikulicic <mmikulicic@gmail.com> Co-authored-by: Eduard Karacharov <13005055+korowa@users.noreply.github.com>

github-actions bot added ballista datafusion Changes in the datafusion crate labels Jan 10, 2022

hntd187 reviewed Jan 10, 2022

View reviewed changes

liukun4515 reviewed Jan 11, 2022

View reviewed changes

ballista/rust/core/src/serde/logical_plan/mod.rs Outdated Show resolved Hide resolved

liukun4515 reviewed Jan 11, 2022

View reviewed changes

datafusion/src/physical_plan/coercion_rule/aggregate_rule.rs Show resolved Hide resolved

liukun4515 reviewed Jan 11, 2022

View reviewed changes

datafusion/src/physical_plan/aggregates.rs Outdated Show resolved Hide resolved

domodwyer force-pushed the dom/tdigest branch 2 times, most recently from 066fcb8 to d72ca10 Compare January 11, 2022 19:57

Dandandan reviewed Jan 11, 2022

View reviewed changes

realno mentioned this pull request Jan 19, 2022

Add median, std, and corr functions #1486

Closed

realno reviewed Jan 19, 2022

View reviewed changes

domodwyer added 7 commits January 25, 2022 22:19

feat: approx_quantile aggregation

d9a7be2

Adds the ApproxQuantile physical expression, plumbing & test cases. The function signature is: approx_quantile(column, quantile) Where column can be any numeric type (that can be cast to a float64) and quantile is a float64 literal between 0 and 1.

feat: approx_quantile dataframe function

0cbacd1

Adds the approx_quantile() dataframe function, and exports it in the prelude.

refactor: use input type as return type

85af343

Casts the calculated quantile value to the same type as the input data.

fixup! refactor: bastilla approx_quantile support

e8f8e3f

refactor: rebase onto main

faa8094

domodwyer force-pushed the dom/tdigest branch from d72ca10 to ef546e4 Compare January 27, 2022 20:34

domodwyer added 2 commits January 29, 2022 11:38

refactor: validate quantile value

03a5eff

Ensures the quantile values is between 0 and 1, emitting a plan error if not.

refactor: rename to approx_percentile_cont

c216f48

domodwyer force-pushed the dom/tdigest branch from ef546e4 to c216f48 Compare January 29, 2022 11:55

realno approved these changes Jan 29, 2022

View reviewed changes

houqp requested a review from Dandandan January 31, 2022 07:21

houqp requested a review from alamb January 31, 2022 07:21

houqp added the enhancement New feature or request label Jan 31, 2022

refactor: clippy lints

3612493

alamb approved these changes Jan 31, 2022

View reviewed changes

alamb merged commit cfb655d into apache:master Jan 31, 2022

alamb changed the title ~~approx_quantile() aggregation function~~ Add approx_quantile() aggregation function Feb 10, 2022

alamb changed the title ~~Add approx_quantile() aggregation function~~ Add approx_percentile_cont() aggregation function Aug 15, 2022

samuelcolvin mentioned this pull request Sep 19, 2024

Alias APPROX_PERCENTILE_CONT as PERCENTILE_CONT? #12533

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `approx_percentile_cont()` aggregation function #1539

Add `approx_percentile_cont()` aggregation function #1539

domodwyer commented Jan 10, 2022 •

edited

Loading

hntd187 Jan 10, 2022

domodwyer Jan 11, 2022 •

edited

Loading

alamb Jan 31, 2022

Dandandan Jan 11, 2022

realno commented Jan 12, 2022

domodwyer commented Jan 16, 2022

realno commented Jan 16, 2022

realno Jan 19, 2022

domodwyer Jan 27, 2022

realno Jan 29, 2022

realno Jan 19, 2022

domodwyer commented Jan 24, 2022

domodwyer commented Jan 27, 2022 •

edited

Loading

domodwyer commented Jan 29, 2022

realno left a comment

alamb left a comment

alamb Jan 31, 2022

Add approx_percentile_cont() aggregation function #1539

Add approx_percentile_cont() aggregation function #1539

Conversation

domodwyer commented Jan 10, 2022 • edited Loading

Which issue does this PR close?

What changes are included in this PR?

Are there any user-facing changes?

Future

Choose a reason for hiding this comment

domodwyer Jan 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

realno commented Jan 12, 2022

domodwyer commented Jan 16, 2022

realno commented Jan 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

domodwyer commented Jan 24, 2022

domodwyer commented Jan 27, 2022 • edited Loading

domodwyer commented Jan 29, 2022

realno left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add `approx_percentile_cont()` aggregation function #1539

Add `approx_percentile_cont()` aggregation function #1539

domodwyer commented Jan 10, 2022 •

edited

Loading

domodwyer Jan 11, 2022 •

edited

Loading

domodwyer commented Jan 27, 2022 •

edited

Loading