Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed #1700

alamb · 2022-01-28T17:26:46Z

Which issue does this PR close?

This diff looks large, but a lot is whitespace -- check out whitespace blind diff https://github.com/apache/arrow-datafusion/pull/1700/files?w=1 for an easier time understanding what actually changed

Rationale for this change

There are several reasonable usecases for creating PhysicalExprs from an Expr outside the context of DefaultPhysicalPlanner
such as predicate pruning and constant evaluation.

To create a PhysicalExpr on main, requires ExecutionContextState which is non trivially large (several HashMaps, and Vecs, and newly a MemoryManager and DiskManager).

This means that the overhead of creating physical expressions is quite large and getting larger!

What changes are included in this PR?

Changes:

Pull the PhysicalExpr creation logic into a free function that requires ExecutionProps rather than an entire ExecutionContextState
Move var provider from ExecutionContextState into ExecutionProps and create the hash table on demand, but leave the interface on ExecutionContext the same
Refactor pruning and constant evaluation to avoid creating ExecutionContextState (the real reason for this PR)

I think in a follow on PR I would move the code into its own module but for this one I left it in the same one for easier diffing

Are there any user-facing changes?

alamb · 2022-01-28T17:29:56Z

datafusion/src/execution/context.rs

@@ -324,8 +323,8 @@ impl ExecutionContext {
        self.state


The pub fn register_variable( API remains the same

alamb · 2022-01-28T17:30:16Z

datafusion/src/execution/context.rs

@@ -1148,8 +1180,6 @@ pub struct ExecutionContextState {
    pub catalog_list: Arc<dyn CatalogList>,
    /// Scalar functions that are registered with the context
    pub scalar_functions: HashMap<String, Arc<ScalarUDF>>,
-    /// Variable provider that are registered with the context
-    pub var_provider: HashMap<VarType, Arc<dyn VarProvider + Send + Sync>>,


while technically this is a breaking change, the ExecutionContext::register_variable remains the same which I think means this change will be minimally disruptive to people using DataFusion

alamb · 2022-01-28T17:31:15Z

datafusion/src/optimizer/simplify_expressions.rs

@@ -238,13 +238,12 @@ pub struct ConstEvaluator {
    /// descendants) so this Expr can be evaluated
    can_evaluate: Vec<bool>,

-    ctx_state: ExecutionContextState,


Here is reason 1 for this change: We no longer have to create a whole new ctx_state and planner for each potential constant evaluation 🎉

alamb · 2022-01-28T17:32:10Z

datafusion/src/physical_optimizer/pruning.rs

-        let execution_context_state = ExecutionContextState::new();
-        let predicate_expr = DefaultPhysicalPlanner::default().create_physical_expr(
+
+        // TODO allow these properties to be passed in


Here is reason 2 for this change: again we save a whole new ctx_state and planner for each potential constant evaluation 🎉

alamb · 2022-01-28T17:33:41Z

datafusion/src/physical_plan/planner.rs

@@ -299,12 +299,11 @@ impl PhysicalPlanner for DefaultPhysicalPlanner {
        input_schema: &Schema,
        ctx_state: &ExecutionContextState,
    ) -> Result<Arc<dyn PhysicalExpr>> {
-        DefaultPhysicalPlanner::create_physical_expr(
-            self,
+        create_physical_expr(


The actual interface for PhysicalPlanner remains the same; However it is now also possible to call the free function create_physical_expr directly too

alamb · 2022-01-28T17:34:52Z

datafusion/src/physical_plan/planner.rs

-                        _ => Err(DataFusionError::Plan(
-                            "No system variable provider found".to_string(),
-                        )),
+/// Create a physical expression from a logical expression ([Expr])


again, whitespace blind diff is the best to see -- this change makes this a free function and changes the parameters from

ctx_state: &ExecutionContextState,

to

execution_props: &ExecutionProps,

alamb · 2022-01-28T18:21:40Z

cc @yjshen

yjshen

LGTM

yjshen · 2022-01-29T09:02:26Z

datafusion/src/execution/context.rs

@@ -1115,9 +1114,14 @@ impl ExecutionConfig {
 /// An instance of this struct is created each time a [`LogicalPlan`] is prepared for
 /// execution (optimized). If the same plan is optimized multiple times, a new
 /// `ExecutionProps` is created each time.
+///
+/// It is important that this structure be cheap to create as it is
+/// done so during predicate pruning and expression simplification


* feat: add join type for logical plan display (#1674) * (minor) Reduce memory manager and disk manager logs from `info!` to `debug!` (#1689) * Move `information_schema` tests out of execution/context.rs to `sql_integration` tests (#1684) * Move tests from context.rs to information_schema.rs * Fix up tests to compile * Move timestamp related tests out of context.rs and into sql integration test (#1696) * Move some tests out of context.rs and into sql * Move support test out of context.rs and into sql tests * Fixup tests and make them compile * Fix parquet projection * fix pruning casting * fix test based on debug strings * revert read_spill method by getting schema from file * Add `MemTrackingMetrics` to ease memory tracking for non-limited memory consumers (#1691) * Memory manager no longer track consumers, update aggregatedMetricsSet * Easy memory tracking with metrics * use tracking metrics in SPMS * tests * fix * doc * Update datafusion/src/physical_plan/sorts/sort.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * make tracker AtomicUsize Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Implement TableProvider for DataFrameImpl (#1699) * Add TableProvider impl for DataFrameImpl * Add physical plan in * Clean up plan construction and names construction * Remove duplicate comments * Remove unused parameter * Add test * Remove duplicate limit comment * Use cloned instead of individual clone * Reduce the amount of code to get a schema Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add comments to test * Fix plan comparison * Compare only the results of execution * Remove println * Refer to df_impl instead of table in test Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fix the register_table test to use the correct result set for comparison * Consolidate group/agg exprs * Format * Remove outdated comment Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * refine test in repartition.rs & coalesce_batches.rs (#1707) * Fuzz test for spillable sort (#1706) * Lazy TempDir creation in DiskManager (#1695) * Incorporate dyn scalar kernels (#1685) * Rebase * impl ToNumeric for ScalarValue * Update macro to be based on * Add floats * Cleanup * Newline * add annotation for select_to_plan (#1714) * Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed (#1700) * Change physical_expr creation API * Refactor API usage to avoid creating ExecutionContextState * Fixup ballista * clippy! * Fix can not load parquet table form spark in datafusion-cli. (#1665) * fix can not load parquet table form spark * add Invalid file in log. * fix fmt * add upper bound for pub fn (#1713) Signed-off-by: remzi <13716567376yh@gmail.com> * Create SchemaAdapter trait to map table schema to file schemas (#1709) * Create SchemaAdapter trait to map table schema to file schemas * Linting fix * Remove commented code * approx_quantile() aggregation function (#1539) * feat: implement TDigest for approx quantile Adds a [TDigest] implementation providing approximate quantile estimations of large inputs using a small amount of (bounded) memory. A TDigest is most accurate near either "end" of the quantile range (that is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that increases resolution at the tails. The paper claims single digit part per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and in practice I have found accuracy to be more than acceptable for an apprixmate function across the entire quantile range. The implementation is a modified copy of https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++ implementation]. Both Facebook's implementation, and Mn02's Rust port are Apache 2.0 licensed. [TDigest]: https://arxiv.org/abs/1902.04023 [Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h * feat: approx_quantile aggregation Adds the ApproxQuantile physical expression, plumbing & test cases. The function signature is: approx_quantile(column, quantile) Where column can be any numeric type (that can be cast to a float64) and quantile is a float64 literal between 0 and 1. * feat: approx_quantile dataframe function Adds the approx_quantile() dataframe function, and exports it in the prelude. * refactor: bastilla approx_quantile support Adds bastilla wire encoding for approx_quantile. Adding support for this required modifying the AggregateExprNode proto message to support propigating multiple LogicalExprNode aggregate arguments - all the existing aggregations take a single argument, so this wasn't needed before. This commit adds "repeated" to the expr field, which I believe is backwards compatible as described here: https://developers.google.com/protocol-buffers/docs/proto3#updating Specifically, adding "repeated" to an existing message field: "For ... message fields, optional is compatible with repeated" No existing tests needed fixing, and a new roundtrip test is included that covers the change to allow multiple expr. * refactor: use input type as return type Casts the calculated quantile value to the same type as the input data. * fixup! refactor: bastilla approx_quantile support * refactor: rebase onto main * refactor: validate quantile value Ensures the quantile values is between 0 and 1, emitting a plan error if not. * refactor: rename to approx_percentile_cont * refactor: clippy lints * suppport bitwise and as an example (#1653) * suppport bitwise and as an example * Use $OP in macro rather than `&` * fix: change signature to &dyn Array * fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix: substr - correct behaivour with negative start pos (#1660) * minor: fix cargo run --release error (#1723) * Convert boolean case expressions to boolean logic (#1719) * Convert boolean case expressions to boolean logic * Review feedback * substitute `parking_lot::Mutex` for `std::sync::Mutex` (#1720) * Substitute parking_lot::Mutex for std::sync::Mutex * enable parking_lot feature in tokio * Add Expression Simplification API (#1717) * Add Expression Simplification API * fmt * use from_slice(&[T]) instead of from_slice(Vec<T>) to prevent future merge conflicts * fix decimal add because arrow2 doesn't include decimal add in arithmetics::add * fix decimal scale for cast test * fix parquet file format adapted projection by providing the proper schema to the RecordBatch Co-authored-by: xudong.w <wxd963996380@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com> Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Co-authored-by: Matthew Turner <matthew.m.turner@outlook.com> Co-authored-by: Yang <37145547+Ted-Jiang@users.noreply.github.com> Co-authored-by: Remzi Yang <59198230+HaoYang670@users.noreply.github.com> Co-authored-by: Dan Harris <1327726+thinkharderdev@users.noreply.github.com> Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: Kun Liu <liukun@apache.org> Co-authored-by: Dmitry Patsura <talk@dmtry.me> Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

* feat: add join type for logical plan display (#1674) * (minor) Reduce memory manager and disk manager logs from `info!` to `debug!` (#1689) * Move `information_schema` tests out of execution/context.rs to `sql_integration` tests (#1684) * Move tests from context.rs to information_schema.rs * Fix up tests to compile * Move timestamp related tests out of context.rs and into sql integration test (#1696) * Move some tests out of context.rs and into sql * Move support test out of context.rs and into sql tests * Fixup tests and make them compile * Add `MemTrackingMetrics` to ease memory tracking for non-limited memory consumers (#1691) * Memory manager no longer track consumers, update aggregatedMetricsSet * Easy memory tracking with metrics * use tracking metrics in SPMS * tests * fix * doc * Update datafusion/src/physical_plan/sorts/sort.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * make tracker AtomicUsize Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Implement TableProvider for DataFrameImpl (#1699) * Add TableProvider impl for DataFrameImpl * Add physical plan in * Clean up plan construction and names construction * Remove duplicate comments * Remove unused parameter * Add test * Remove duplicate limit comment * Use cloned instead of individual clone * Reduce the amount of code to get a schema Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add comments to test * Fix plan comparison * Compare only the results of execution * Remove println * Refer to df_impl instead of table in test Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fix the register_table test to use the correct result set for comparison * Consolidate group/agg exprs * Format * Remove outdated comment Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * refine test in repartition.rs & coalesce_batches.rs (#1707) * Fuzz test for spillable sort (#1706) * Lazy TempDir creation in DiskManager (#1695) * Incorporate dyn scalar kernels (#1685) * Rebase * impl ToNumeric for ScalarValue * Update macro to be based on * Add floats * Cleanup * Newline * add annotation for select_to_plan (#1714) * Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed (#1700) * Change physical_expr creation API * Refactor API usage to avoid creating ExecutionContextState * Fixup ballista * clippy! * Fix can not load parquet table form spark in datafusion-cli. (#1665) * fix can not load parquet table form spark * add Invalid file in log. * fix fmt * add upper bound for pub fn (#1713) Signed-off-by: remzi <13716567376yh@gmail.com> * Create SchemaAdapter trait to map table schema to file schemas (#1709) * Create SchemaAdapter trait to map table schema to file schemas * Linting fix * Remove commented code * approx_quantile() aggregation function (#1539) * feat: implement TDigest for approx quantile Adds a [TDigest] implementation providing approximate quantile estimations of large inputs using a small amount of (bounded) memory. A TDigest is most accurate near either "end" of the quantile range (that is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that increases resolution at the tails. The paper claims single digit part per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and in practice I have found accuracy to be more than acceptable for an apprixmate function across the entire quantile range. The implementation is a modified copy of https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++ implementation]. Both Facebook's implementation, and Mn02's Rust port are Apache 2.0 licensed. [TDigest]: https://arxiv.org/abs/1902.04023 [Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h * feat: approx_quantile aggregation Adds the ApproxQuantile physical expression, plumbing & test cases. The function signature is: approx_quantile(column, quantile) Where column can be any numeric type (that can be cast to a float64) and quantile is a float64 literal between 0 and 1. * feat: approx_quantile dataframe function Adds the approx_quantile() dataframe function, and exports it in the prelude. * refactor: bastilla approx_quantile support Adds bastilla wire encoding for approx_quantile. Adding support for this required modifying the AggregateExprNode proto message to support propigating multiple LogicalExprNode aggregate arguments - all the existing aggregations take a single argument, so this wasn't needed before. This commit adds "repeated" to the expr field, which I believe is backwards compatible as described here: https://developers.google.com/protocol-buffers/docs/proto3#updating Specifically, adding "repeated" to an existing message field: "For ... message fields, optional is compatible with repeated" No existing tests needed fixing, and a new roundtrip test is included that covers the change to allow multiple expr. * refactor: use input type as return type Casts the calculated quantile value to the same type as the input data. * fixup! refactor: bastilla approx_quantile support * refactor: rebase onto main * refactor: validate quantile value Ensures the quantile values is between 0 and 1, emitting a plan error if not. * refactor: rename to approx_percentile_cont * refactor: clippy lints * suppport bitwise and as an example (#1653) * suppport bitwise and as an example * Use $OP in macro rather than `&` * fix: change signature to &dyn Array * fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix: substr - correct behaivour with negative start pos (#1660) * minor: fix cargo run --release error (#1723) * Convert boolean case expressions to boolean logic (#1719) * Convert boolean case expressions to boolean logic * Review feedback * substitute `parking_lot::Mutex` for `std::sync::Mutex` (#1720) * Substitute parking_lot::Mutex for std::sync::Mutex * enable parking_lot feature in tokio * Add Expression Simplification API (#1717) * Add Expression Simplification API * fmt * Add tests and CI for optional pyarrow module (#1711) * Implement other side of conversion * Add test workflow * Add (failing) tests * Get unit tests passing * Use python -m pip * Debug LD_LIBRARY_PATH * Set LIBRARY_PATH * Update help with better info * Update parking_lot requirement from 0.11 to 0.12 (#1735) Updates the requirements on [parking_lot](https://github.com/Amanieu/parking_lot) to permit the latest version. - [Release notes](https://github.com/Amanieu/parking_lot/releases) - [Changelog](https://github.com/Amanieu/parking_lot/blob/master/CHANGELOG.md) - [Commits](Amanieu/parking_lot@0.11.0...0.12.0) --- updated-dependencies: - dependency-name: parking_lot dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Prevent repartitioning of certain operator's direct children (#1731) (#1732) * Prevent repartitioning of certain operator's direct children (#1731) * Update ballista tests * Don't repartition children of RepartitionExec * Revert partition restriction on Repartition and Projection * Review feedback * Lint * API to get Expr's type and nullability without a `DFSchema` (#1726) * API to get Expr type and nullability without a `DFSchema` * Add test * publically export * Improve docs * Fix typos in crate documentation (#1739) * add `cargo check --release` to ci (#1737) * remote test * Update .github/workflows/rust.yml Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Move optimize test out of context.rs (#1742) * Move optimize test out of context.rs * Update * use clap 3 style args parsing for datafusion cli (#1749) * use clap 3 style args parsing for datafusion cli * upgrade cli version * Add partitioned_csv setup code to sql_integration test (#1743) * use ordered-float 2.10 (#1756) Signed-off-by: Andy Grove <agrove@apache.org> * #1768 Support TimeUnit::Second in hasher (#1769) * Support TimeUnit::Second in hasher * fix linter * format (#1745) * Create built-in scalar functions programmatically (#1734) * create build-in scalar functions programatically Signed-off-by: remzi <13716567376yh@gmail.com> * solve conflict Signed-off-by: remzi <13716567376yh@gmail.com> * fix spelling mistake Signed-off-by: remzi <13716567376yh@gmail.com> * rename to call_fn Signed-off-by: remzi <13716567376yh@gmail.com> * [split/1] split datafusion-common module (#1751) * split datafusion-common module * pyarrow * Update datafusion-common/README.md Co-authored-by: Andy Grove <agrove@apache.org> * Update datafusion/Cargo.toml * include publishing Co-authored-by: Andy Grove <agrove@apache.org> * fix: Case insensitive unquoted identifiers (#1747) * move dfschema and column (#1758) * add datafusion-expr module (#1759) * move column, dfschema, etc. to common module (#1760) * include window frames and operator into datafusion-expr (#1761) * move signature, type signature, and volatility to split module (#1763) * [split/10] split up expr for rewriting, visiting, and simplification traits (#1774) * split up expr for rewriting, visiting, and simplification * add docs * move built-in scalar functions (#1764) * split expr type and null info to be expr-schemable (#1784) * rewrite predicates before pushing to union inputs (#1781) * move accumulator and columnar value (#1765) * move accumulator and columnar value (#1762) * fix bad data type in test_try_cast_decimal_to_decimal * added projections for avro columns Co-authored-by: xudong.w <wxd963996380@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com> Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Co-authored-by: Matthew Turner <matthew.m.turner@outlook.com> Co-authored-by: Yang <37145547+Ted-Jiang@users.noreply.github.com> Co-authored-by: Remzi Yang <59198230+HaoYang670@users.noreply.github.com> Co-authored-by: Dan Harris <1327726+thinkharderdev@users.noreply.github.com> Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: Kun Liu <liukun@apache.org> Co-authored-by: Dmitry Patsura <talk@dmtry.me> Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> Co-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: r.4ntix <r.4ntix@gmail.com> Co-authored-by: Jiayu Liu <Jimexist@users.noreply.github.com> Co-authored-by: Andy Grove <agrove@apache.org> Co-authored-by: Rich <jychen7@users.noreply.github.com> Co-authored-by: Marko Mikulicic <mmikulicic@gmail.com> Co-authored-by: Eduard Karacharov <13005055+korowa@users.noreply.github.com>

alamb added 2 commits January 28, 2022 12:19

Change physical_expr creation API

8ca425c

Refactor API usage to avoid creating ExecutionContextState

72eca34

github-actions bot added the datafusion Changes in the datafusion crate label Jan 28, 2022

Fixup ballista

8945948

alamb commented Jan 28, 2022

View reviewed changes

github-actions bot added the ballista label Jan 28, 2022

clippy!

d43a7cb

alamb mentioned this pull request Jan 28, 2022

Lazy TempDir creation in DiskManager #1695

Merged

yjshen approved these changes Jan 29, 2022

View reviewed changes

alamb merged commit 1caf52a into apache:master Jan 31, 2022

alamb deleted the alamb/physical_expr_api branch August 8, 2023 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed #1700

Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed #1700

alamb commented Jan 28, 2022 •

edited

Loading

alamb Jan 28, 2022

alamb Jan 28, 2022

alamb Jan 28, 2022

alamb Jan 28, 2022

alamb Jan 28, 2022

alamb Jan 28, 2022

alamb commented Jan 28, 2022

yjshen left a comment

yjshen Jan 29, 2022

Support create_physical_expr and ExecutionContextState or DefaultPhysicalPlanner for faster speed #1700

Support create_physical_expr and ExecutionContextState or DefaultPhysicalPlanner for faster speed #1700

Conversation

alamb commented Jan 28, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb Jan 28, 2022

Choose a reason for hiding this comment

alamb Jan 28, 2022

Choose a reason for hiding this comment

alamb Jan 28, 2022

Choose a reason for hiding this comment

alamb Jan 28, 2022

Choose a reason for hiding this comment

alamb Jan 28, 2022

Choose a reason for hiding this comment

alamb Jan 28, 2022

Choose a reason for hiding this comment

alamb commented Jan 28, 2022

yjshen left a comment

Choose a reason for hiding this comment

yjshen Jan 29, 2022

Choose a reason for hiding this comment

Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed #1700

Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed #1700

alamb commented Jan 28, 2022 •

edited

Loading