Skip to content

SessionContext::create_physical_expr does not unwrap casts (and thus is not always optimal) #14987

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

As described by @ion-elgreco in #14944

Given a dataset with an Int64 column named month, when a predicate such as the following is created

month_id = '202502'

In sql / dataframe queries, this will be simplified to the following. Note the constant is cast and the column is not cast:

month_id = cast('202502', 'Int64')

However, when using SessionContext::create_physical_expr to create a physical expression directly, as is done in delta.rs and other systems like LanceDB, the expression looks like this (cast on the column)

cast(month_id, 'Int64') = '202502'

This is bad for two reasons:

  1. PruningPredicate can't handle this type of expression (and thus it can't be used to prune Parquet row groups)
  2. Evaluating this filter is substantially slower as it has to apply a transformation to all values of month_id before it can evaluate the filter. And furthermore it does slow string comparison compared to faster int63 comparison

The reason this happens is that the conversion from cast(month_id, 'Int64') = '202502' to month_id = Cast('202502', Int64) happens in the Analyzer, specifically here:

/// [`UnwrapCastInComparison`] attempts to remove casts from
/// comparisons to literals ([`ScalarValue`]s) by applying the casts
/// to the literals if possible. It is inspired by the optimizer rule
/// `UnwrapCastInBinaryComparison` of Spark.
///
/// Removing casts often improves performance because:
/// 1. The cast is done once (to the literal) rather than to every value
/// 2. Can enable other optimizations such as predicate pushdown that
/// don't support casting
///
/// The rule is applied to expressions of the following forms:
///
/// 1. `cast(left_expr as data_type) comparison_op literal_expr`
/// 2. `literal_expr comparison_op cast(left_expr as data_type)`
/// 3. `cast(literal_expr) IN (expr1, expr2, ...)`
/// 4. `literal_expr IN (cast(expr1) , cast(expr2), ...)`
///
/// If the expression matches one of the forms above, the rule will
/// ensure the value of `literal` is in range(min, max) of the
/// expr's data_type, and if the scalar is within range, the literal
/// will be casted to the data type of expr on the other side, and the
/// cast will be removed from the other side.
///
/// # Example
///
/// If the DataType of c1 is INT32. Given the filter
///
/// ```text
/// Filter: cast(c1 as INT64) > INT64(10)`
/// ```
///
/// This rule will remove the cast and rewrite the expression to:
///
/// ```text
/// Filter: c1 > INT32(10)
/// ```
///
#[derive(Default, Debug)]
pub struct UnwrapCastInComparison {}

However, this pass is not run as part of SessionContext::create_physical_expr

Describe the solution you'd like

I would like the expressions crated by SessionContext::create_physical_expr to have had their casts unwrapped as well

Describe alternatives you've considered

THe ideal solution in my mind is to remove the entire Analyzer pass and instead do the unwrap in comparisons as part of the expression simplification

https://docs.rs/datafusion/latest/datafusion/optimizer/simplify_expressions/expr_simplifier/struct.ExprSimplifier.html

Additional context

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestperformanceMake DataFusion faster

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions