Merge adjacent filter rule for optimizer #2026

jackwener · 2022-03-16T17:12:00Z

Which issue does this PR close?

Closes #2016 .

explain select c1, c2 from test where c3 = true and c2 = 0.000001;

Before

+---------------+-------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: #test.c1, #test.c2                                                                                                      |
|               |   Filter: #test.c3                                                                                                                  |
|               |     Filter: #test.c2 = Float64(0.000001)                                                                                            |
|               |       TableScan: test projection=Some([0, 1, 2]), filters=[#test.c3, #test.c2 = Float64(0.000001)]                                  |
| physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                                                                                       |
|               |   CoalesceBatchesExec: target_batch_size=4096                                                                                       |
|               |     FilterExec: c3@2                                                                                                                |
|               |       CoalesceBatchesExec: target_batch_size=4096                                                                                   |
|               |         FilterExec: c2@1 = 0.000001                                                                                                 |
|               |           RepartitionExec: partitioning=RoundRobinBatch(8)                                                                          |
|               |             CsvExec: files=[/home/jakevin/code/arrow-datafusion/datafusion/tests/aggregate_simple.csv], has_header=true, limit=None |
|               |                                                                                                                                     |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------+

After

+---------------+---------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                            |
+---------------+---------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: #test.c1, #test.c2                                                                                                  |
|               |   Filter: #test.c3 AND #test.c2 = Float64(0.000001)                                                                             |
|               |     TableScan: test projection=Some([0, 1, 2]), filters=[#test.c3, #test.c2 = Float64(0.000001)]                                |
| physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]                                                                                   |
|               |   CoalesceBatchesExec: target_batch_size=4096                                                                                   |
|               |     FilterExec: c3@2 AND c2@1 = 0.000001                                                                                        |
|               |       RepartitionExec: partitioning=RoundRobinBatch(8)                                                                          |
|               |         CsvExec: files=[/home/jakevin/code/arrow-datafusion/datafusion/tests/aggregate_simple.csv], has_header=true, limit=None |
|               |                                                                                                                                 |
+---------------+---------------------------------------------------------------------------------------------------------------------------------+

Rationale for this change

merge adjacent filter

Are there any user-facing changes?

None

jackwener · 2022-03-16T17:16:14Z

@alamb @Dandandan @houqp PTAL❤😃.

xudong963 · 2022-03-17T01:15:11Z

In fact, I doubt if it really has gains, could you please do some benchmarks for the case? Thanks @jackwener

jackwener · 2022-03-17T02:22:31Z

In fact, I doubt if it really has gains, could you please do some benchmarks for the case? Thanks @jackwener

Let me explain about this rule. It's a common and base optimizer rule. Postgresql, mysql, cockroach all implement this rule.

like pg:

explain select id, firstname from scientist where id > 3 and lastname = 's';

QUERY PLAN
--
Seq Scan on scientist (cost=0.00..17.20 rows=1 width=72)
Filter: ((id > 3) AND ((lastname)::text = 's'::text))

jackwener · 2022-03-17T02:27:02Z

In many time, It doesn't have a big impact on performance. But it can simplify the query plan.

What' more, other plan can benefit from it, like filter reorder, because we don't need to traverse the plan node instead of just focus this filter operator. Because the adjacent filter order is exchangeable.

jackwener · 2022-03-17T02:31:12Z

This postgresql example is more directly.

explain select id, firstname from 
(select id, firstname, lastname from scientist where id > 3 ) t
 where lastname = 's';

QUERY PLAN
--
Seq Scan on scientist (cost=0.00..17.20 rows=1 width=72)
Filter: ((id > 3) AND ((lastname)::text = 's'::text))

xudong963 · 2022-03-17T02:45:30Z

Make sense to me, thank you @jackwener

xudong963

LGTM, thanks again @jackwener. I commented on a few minor flaws :)

xudong963 · 2022-03-17T03:17:12Z

datafusion/src/optimizer/merge_adjacent_filter.rs

+        let new_plan = optimize(plan);
+
+        // Apply the optimization to all inputs of the plan
+        let inputs = &new_plan.inputs();


The & is redundant?

datafusion/src/optimizer/merge_adjacent_filter.rs

yjshen

Thanks @jackwener for optimizing this. I think we are on the right track currently but could go steps further to remove redundant conditions while combining filters.

Regarding the current tests, I think we could:

Have more tests if we want to eliminate unnecessary conditions (as suggested in the code comments). With both conjunctive and disjunctive filter test cases.
Add SQL tests as well.

datafusion/src/optimizer/merge_adjacent_filter.rs

doki23 · 2022-03-19T03:21:26Z

In my superficial opinion, we should make the sql planner produce one filter when creating the logical plan instead of adding an optimized rule. In other words, I think may there is a bug in the planner.

xudong963

@doki23 's comment reminded me, then I tested it with SQL and found that there is no problem as described in this ticket-related issue.

I found some other problems

xudong963 · 2022-03-19T10:21:49Z

The following is my test code used by sql.

#[tokio::test]
async fn main() -> Result<()> {
    // create local execution context
    let mut ctx = SessionContext::new();

    // register csv file with the execution context
    ctx.register_csv("test", "tests/aggregate_simple.csv", CsvReadOptions::new())
        .await?;

    // execute the query
    let plan = ctx.create_logical_plan(
        "select c1, c2 from test where c3 = true and c2 = 0.000001",
    )?;

    dbg!(plan);

    Ok(())
}

Then I got the plan

Projection: #test.c1, #test.c2
  Filter: #test.c3 = Boolean(true) AND #test.c2 = Float64(0.000001)
    TableScan: test projection=Non

I also tested use datafusion-cli:

❯ create table t as SELECT * FROM (VALUES (1,true), (2,false)) as t;
0 rows in set. Query took 0.003 seconds.
❯ select * from t;
+---------+---------+
| column1 | column2 |
+---------+---------+
| 1       | true    |
| 2       | false   |
+---------+---------+
2 rows in set. Query took 0.002 seconds.
❯ explain select * from t where column1 = 2 and column2 = true;
+---------------+-------------------------------------------------------------------+
| plan_type     | plan                                                              |
+---------------+-------------------------------------------------------------------+
| logical_plan  | Projection: #t.column1, #t.column2                                |
|               |   Filter: #t.column1 = Int64(2) AND #t.column2                    |
|               |     TableScan: t projection=Some([0, 1])                          |
| physical_plan | ProjectionExec: expr=[column1@0 as column1, column2@1 as column2] |
|               |   CoalesceBatchesExec: target_batch_size=4096                     |
|               |     FilterExec: column1@0 = 2 AND column2@1                       |
|               |       RepartitionExec: partitioning=RoundRobinBatch(12)           |
|               |         MemoryExec: partitions=1, partition_sizes=[1]             |
|               |                                                                   |
+---------------+-------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.

Two cases will result in adjacent filters in logical plan:

Use dataframe: df.xxx.filter(filter1).filter(filter2);
Directly build logical plan by LogicalPlanBuilder: LogicalPlanBuilder::from(xx).xxx.filter().filter()...

For dataframe users, they can use the following to avoid adjacent filters

    let filter1 = col("b").eq(lit(10));

    let filter2 = col("a").eq(lit("a"));
    
    let filter = filter1.and(filter2);

    let df = df
        .select_columns(&["a", "b"])?
        .filter(filter)?;

jackwener · 2022-03-19T13:48:46Z

❤😃 Thanks @xudong963, @doki23

alamb · 2022-03-20T09:41:57Z

For anyone following along, follow on PR is #2039 #2038

optimizer: merge adjacent filter

33b0e91

jackwener force-pushed the merge branch 2 times, most recently from d45b908 to 33b0e91 Compare March 16, 2022 17:15

jackwener changed the title ~~merge adjacent filter rule for optimizer~~ Merge adjacent filter rule for optimizer Mar 17, 2022

xudong963 previously approved these changes Mar 17, 2022

View reviewed changes

yjshen reviewed Mar 17, 2022

View reviewed changes

datafusion/src/optimizer/merge_adjacent_filter.rs Outdated Show resolved Hide resolved

datafusion/src/optimizer/merge_adjacent_filter.rs Outdated Show resolved Hide resolved

datafusion/src/optimizer/merge_adjacent_filter.rs Outdated Show resolved Hide resolved

github-actions bot added the datafusion Changes in the datafusion crate label Mar 18, 2022

optimizer: change record to review

a570a09

jackwener force-pushed the merge branch from f1c497a to a570a09 Compare March 18, 2022 03:32

xudong963 self-requested a review March 19, 2022 10:07

xudong963 reviewed Mar 19, 2022

View reviewed changes

jackwener closed this Mar 19, 2022

jackwener mentioned this pull request Mar 19, 2022

Filter push down rule cause the wrong plan #2038

Closed

jackwener deleted the merge branch November 24, 2022 02:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge adjacent filter rule for optimizer #2026

Merge adjacent filter rule for optimizer #2026

jackwener commented Mar 16, 2022

jackwener commented Mar 16, 2022 •

edited

Loading

xudong963 commented Mar 17, 2022

jackwener commented Mar 17, 2022 •

edited

Loading

jackwener commented Mar 17, 2022 •

edited

Loading

jackwener commented Mar 17, 2022

xudong963 commented Mar 17, 2022

xudong963 left a comment

xudong963 Mar 17, 2022

yjshen left a comment •

edited

Loading

doki23 commented Mar 19, 2022 •

edited

Loading

xudong963 left a comment

xudong963 commented Mar 19, 2022 •

edited

Loading

jackwener commented Mar 19, 2022

alamb commented Mar 20, 2022

Merge adjacent filter rule for optimizer #2026

Merge adjacent filter rule for optimizer #2026

Conversation

jackwener commented Mar 16, 2022

Which issue does this PR close?

Rationale for this change

Are there any user-facing changes?

jackwener commented Mar 16, 2022 • edited Loading

xudong963 commented Mar 17, 2022

jackwener commented Mar 17, 2022 • edited Loading

jackwener commented Mar 17, 2022 • edited Loading

jackwener commented Mar 17, 2022

xudong963 commented Mar 17, 2022

xudong963 left a comment

Choose a reason for hiding this comment

xudong963 Mar 17, 2022

Choose a reason for hiding this comment

yjshen left a comment • edited Loading

Choose a reason for hiding this comment

doki23 commented Mar 19, 2022 • edited Loading

xudong963 left a comment

Choose a reason for hiding this comment

xudong963 commented Mar 19, 2022 • edited Loading

jackwener commented Mar 19, 2022

alamb commented Mar 20, 2022

jackwener commented Mar 16, 2022 •

edited

Loading

jackwener commented Mar 17, 2022 •

edited

Loading

jackwener commented Mar 17, 2022 •

edited

Loading

yjshen left a comment •

edited

Loading

doki23 commented Mar 19, 2022 •

edited

Loading

xudong963 commented Mar 19, 2022 •

edited

Loading