Skip to content

Dynamic Filter Pushdown is being applied to the wrong table #17196

@nuno-faria

Description

@nuno-faria

Describe the bug

In the following example, the Dynamic Filter Pushdown is built from t2 (k, v) and correctly pushed to t1 (k), to reduce the number of scanned rows from t1:

select *
from t1 -- 1M rows, will become the right side
join t2 on t1.k = t2.k
where t2.v = 1000000; -- 1 row, will become the left side

HashJoinExec join_type=Inner build_time=78.306µs, join_time=22.402µs
    DataSourceExec t2.parquet
        predicate=v@1 = 1000000
        output_rows=1
    DataSourceExec t1.parquet
        predicate=DynamicFilterPhysicalExpr [ k@0 >= 1000000 AND k@0 <= 1000000 ] -- dynamic filter added here
        output_rows=1 -- only returns 1 row

Now if the t2 side is greater than t1, then the order of the join switches, as the smaller relation becomes the build side. However, the Dynamic Filter Pushdown is still applied to the right relation (now t2) instead of t1:

select *
from t1 -- 1M rows, will become the left side
join t2 on t1.k = t2.k
where t2.v >= 1000000; -- 9M rows, will become the right side

HashJoinExec join_type=Inner build_time=54.3462ms, join_time=2.403µs -- join is more expensive
    DataSourceExec t1.parquet
        output_rows=1000000
    DataSourceExec t2.parquet
        predicate=v@1 >= 1000000 AND DynamicFilterPhysicalExpr [ k@0 >= 3 AND k@0 <= 999998 ] -- dynamic filter added here
        output_rows=0 -- should be 1, possibly due to issue https://github.com/apache/datafusion/issues/17188

To Reproduce

copy (select i as k from generate_series(1, 1000000) t(i)) to 't1.parquet';
copy (select i as k, i as v from generate_series(1, 10000000) t(i)) to 't2.parquet';
create external table t1 stored as parquet location 't1.parquet';
create external table t2 stored as parquet location 't2.parquet';

-- works
explain analyze select *
from t1
join t2 on t1.k = t2.k
where t2.v = 1000000;

-- pushed to the wrong relation
explain analyze select *
from t1
join t2 on t1.k = t2.k
where t2.v >= 1000000;

Expected behavior

Dynamic Filter applied to the correct table.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions