-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
- related to Should
PruningPredicate
coerce? #14944
The usecase is a predicate like this, where month_id is an integer:
month_id = '202502'
In DataFusion this results in casting the month_id
column to a Utf8
and then is compared to the string '202502'
This is non ideal for at least 3 reasons:
- Converting many rows to strings is expensive
- Comparing strings is much slower than comparing integers
- many data sources (like parquet) only handle predicates in the form of
<col> <op> <const>
and notcast(<col>) <op> <const>
so these predicates can be pushed down
Here is an example (note the predicate in the plan below is CAST(foo.month_id AS Utf8) = Utf8("2024")
rather than foo.month_id = Int32("2024")
):
DataFusion CLI v46.0.0
> create table foo(month_id int) as values (1), (2), (3);
0 row(s) fetched.
Elapsed 0.003 seconds.
> explain select * from foo where month_id = '2024';
+---------------+-------------------------------------------------------+
| plan_type | plan |
+---------------+-------------------------------------------------------+
| logical_plan | Filter: CAST(foo.month_id AS Utf8) = Utf8("2024") |
| | TableScan: foo projection=[month_id] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: CAST(month_id@0 AS Utf8) = 2024 |
| | DataSourceExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+-------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.008 seconds.
Describe the solution you'd like
I would like the filter in the plan above to be FilterExec: month_id = 2024
(no cast on month_id)
Other notes:
- I think this applies to
=
and!=
- I don't think it can be done for
<
,<=
,>=
and>
as the semantics for comparing ints and strings is different than equality.
Describe alternatives you've considered
We already have something called unwrap_in_cast
that does this transformation
@jayzhan211 has moved this code in this PR
I think once that PR merged, that would be the natural place to add this optimization
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request