-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is your feature request related to a problem or challenge?
As described by @ion-elgreco in #14944
Given a dataset with an Int64
column named month
, when a predicate such as the following is created
month_id = '202502'
In sql / dataframe queries, this will be simplified to the following. Note the constant is cast and the column is not cast:
month_id = cast('202502', 'Int64')
However, when using SessionContext::create_physical_expr
to create a physical expression directly, as is done in delta.rs and other systems like LanceDB, the expression looks like this (cast on the column)
cast(month_id, 'Int64') = '202502'
This is bad for two reasons:
PruningPredicate
can't handle this type of expression (and thus it can't be used to prune Parquet row groups)- Evaluating this filter is substantially slower as it has to apply a transformation to all values of
month_id
before it can evaluate the filter. And furthermore it does slow string comparison compared to faster int63 comparison
The reason this happens is that the conversion from cast(month_id, 'Int64') = '202502'
to month_id = Cast('202502', Int64)
happens in the Analyzer, specifically here:
datafusion/datafusion/optimizer/src/unwrap_cast_in_comparison.rs
Lines 39 to 77 in 2fcab2e
/// [`UnwrapCastInComparison`] attempts to remove casts from | |
/// comparisons to literals ([`ScalarValue`]s) by applying the casts | |
/// to the literals if possible. It is inspired by the optimizer rule | |
/// `UnwrapCastInBinaryComparison` of Spark. | |
/// | |
/// Removing casts often improves performance because: | |
/// 1. The cast is done once (to the literal) rather than to every value | |
/// 2. Can enable other optimizations such as predicate pushdown that | |
/// don't support casting | |
/// | |
/// The rule is applied to expressions of the following forms: | |
/// | |
/// 1. `cast(left_expr as data_type) comparison_op literal_expr` | |
/// 2. `literal_expr comparison_op cast(left_expr as data_type)` | |
/// 3. `cast(literal_expr) IN (expr1, expr2, ...)` | |
/// 4. `literal_expr IN (cast(expr1) , cast(expr2), ...)` | |
/// | |
/// If the expression matches one of the forms above, the rule will | |
/// ensure the value of `literal` is in range(min, max) of the | |
/// expr's data_type, and if the scalar is within range, the literal | |
/// will be casted to the data type of expr on the other side, and the | |
/// cast will be removed from the other side. | |
/// | |
/// # Example | |
/// | |
/// If the DataType of c1 is INT32. Given the filter | |
/// | |
/// ```text | |
/// Filter: cast(c1 as INT64) > INT64(10)` | |
/// ``` | |
/// | |
/// This rule will remove the cast and rewrite the expression to: | |
/// | |
/// ```text | |
/// Filter: c1 > INT32(10) | |
/// ``` | |
/// | |
#[derive(Default, Debug)] | |
pub struct UnwrapCastInComparison {} |
However, this pass is not run as part of SessionContext::create_physical_expr
Describe the solution you'd like
I would like the expressions crated by SessionContext::create_physical_expr
to have had their casts unwrapped as well
Describe alternatives you've considered
THe ideal solution in my mind is to remove the entire Analyzer pass and instead do the unwrap in comparisons as part of the expression simplification