Skip to content

Avoid casting columns when comparing ints and strings #15035

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

The usecase is a predicate like this, where month_id is an integer:

month_id = '202502'

In DataFusion this results in casting the month_id column to a Utf8 and then is compared to the string '202502'

This is non ideal for at least 3 reasons:

  1. Converting many rows to strings is expensive
  2. Comparing strings is much slower than comparing integers
  3. many data sources (like parquet) only handle predicates in the form of <col> <op> <const> and not cast(<col>) <op> <const> so these predicates can be pushed down

Here is an example (note the predicate in the plan below is CAST(foo.month_id AS Utf8) = Utf8("2024") rather than foo.month_id = Int32("2024")):

DataFusion CLI v46.0.0
> create table foo(month_id int) as values (1), (2), (3);
0 row(s) fetched.
Elapsed 0.003 seconds.

> explain select * from foo where month_id = '2024';
+---------------+-------------------------------------------------------+
| plan_type     | plan                                                  |
+---------------+-------------------------------------------------------+
| logical_plan  | Filter: CAST(foo.month_id AS Utf8) = Utf8("2024")     |
|               |   TableScan: foo projection=[month_id]                |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192           |
|               |   FilterExec: CAST(month_id@0 AS Utf8) = 2024         |
|               |     DataSourceExec: partitions=1, partition_sizes=[1] |
|               |                                                       |
+---------------+-------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.008 seconds.

Describe the solution you'd like

I would like the filter in the plan above to be FilterExec: month_id = 2024 (no cast on month_id)

Other notes:

  1. I think this applies to = and !=
  2. I don't think it can be done for <, <=, >= and > as the semantics for comparing ints and strings is different than equality.

Describe alternatives you've considered

We already have something called unwrap_in_cast that does this transformation

@jayzhan211 has moved this code in this PR

I think once that PR merged, that would be the natural place to add this optimization

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions