Skip to content

Conversation

@chenkovsky
Copy link
Contributor

@chenkovsky chenkovsky commented Mar 27, 2025

Which issue does this PR close?

Rationale for this change

for sql

with test AS (SELECT i as needle FROM generate_series(1, 10) t(i))
select count(*) from test WHERE 1 = 1;

The root cause is that after optimize_projections.
columns are dropped from logical plan, but in physical plan, there is still a value column.

What changes are included in this PR?

add project in generated series

Are these changes tested?

UT

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Mar 27, 2025
@jayzhan211
Copy link
Contributor

Looks correct

query TT
explain
with test AS (SELECT i as needle FROM generate_series(1, 10) t(i))
select count(*) from test WHERE 1 = 1;
----
logical_plan
01)Projection: count(Int64(1)) AS count(*)
02)--Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
03)----SubqueryAlias: test
04)------SubqueryAlias: t
05)--------TableScan: tmp_table projection=[]
physical_plan
01)ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
02)--AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
03)----CoalescePartitionsExec
04)------AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
05)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
06)----------LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=1, end=10, batch_size=8192]

@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 28, 2025

but why don't we have "value" column in logical plan schema

[2025-03-28T00:50:56Z DEBUG datafusion_optimizer::utils] eliminate_filter:
    Projection: count(Int64(1)) AS count(*)
      Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
        SubqueryAlias: test
          Projection: t.i AS needle
            SubqueryAlias: t
              Projection: tmp_table.value AS i
                TableScan: tmp_table
    
[2025-03-28T00:50:56Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_cross_join' (pass 0)
[2025-03-28T00:50:56Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_limit' (pass 0)
[2025-03-28T00:50:56Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'propagate_empty_relation' (pass 0)
[2025-03-28T00:50:56Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_one_union' (pass 0)
[2025-03-28T00:50:56Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'filter_null_join_keys' (pass 0)
[2025-03-28T00:50:56Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_outer_join' (pass 0)
[2025-03-28T00:50:56Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'push_down_limit' (pass 0)
[2025-03-28T00:50:56Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'push_down_filter' (pass 0)
[2025-03-28T00:50:56Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'single_distinct_aggregation_to_group_by' (pass 0)
[2025-03-28T00:50:56Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_group_by_constant' (pass 0)
[2025-03-28T00:50:56Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'common_sub_expression_eliminate' (pass 0)
[2025-03-28T00:50:56Z DEBUG datafusion_optimizer::utils] optimize_projections:
    Projection: count(Int64(1)) AS count(*)
      Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
        SubqueryAlias: test
          SubqueryAlias: t
            TableScan: tmp_table projection=[]

I thought value should be pushed down to projection but it is dropped 🤔 ?

@chenkovsky
Copy link
Contributor Author

chenkovsky commented Mar 28, 2025

but why don't we have "value" column in logical plan schema

it's also an alternative solution. we can keep the column in logical plan. but i think that count(*) actually doesnt depend on any column on input logically. i cannot say current logicalplan is wrong. so i dont know which way is better.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 29, 2025

count(*) actually doesnt depend on any column on input logically

count(*) need to know the row number of the column, and it doesn't make sense to count all on "empty" column, so I guess keeping the column in logical plan makes more sense

@chenkovsky
Copy link
Contributor Author

count(*) actually doesnt depend on any column on input logically

count(*) need to know the row number of the column, and it doesn't make sense to count all on "empty" column, so I guess keeping the column in logical plan makes more sense

but, for

with test AS (SELECT i as needle FROM generate_series(1, 10) t(i))
select count(*), i from test WHERE 1 = 1 group by i;

there's no row number in both physical and logical plan. do you think we should also add row number for this sql?

@jayzhan211
Copy link
Contributor

count(*) actually doesnt depend on any column on input logically

count(*) need to know the row number of the column, and it doesn't make sense to count all on "empty" column, so I guess keeping the column in logical plan makes more sense

but, for

with test AS (SELECT i as needle FROM generate_series(1, 10) t(i))
select count(*), i from test WHERE 1 = 1 group by i;

there's no row number in both physical and logical plan. do you think we should also add row number for this sql?

In this case, the row number is 1. there are 10 group, each group contains single number.

@chenkovsky
Copy link
Contributor Author

I found that in the following sql.

create table t(i integer);
SELECT count(*) FROM (select count(*) from t);

for the outer aggregation,
both logical plan and physical plan's schemas are empty.

@github-actions github-actions bot added functions Changes to functions implementation and removed core Core DataFusion crate labels Apr 3, 2025
@chenkovsky
Copy link
Contributor Author

@jayzhan211 could you please review it again?

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chenkovsky

) -> Result<Arc<dyn ExecutionPlan>> {
let batch_size = state.config_options().execution.batch_size;

let schema = match projection {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that you find the root cause instead of having a workaround config!

@jayzhan211 jayzhan211 merged commit a619e00 into apache:main Apr 3, 2025
28 checks passed
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
* fix: aggregation corner case

* Update generate_series.rs

* Update physical_planner.rs

* format

* Update generate_series.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

panic when evaluating trivial WHERE with a CTE

2 participants