Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't preserve functional dependency when generating UNION logical plan #44

Merged
merged 2 commits into from
Oct 16, 2024

Conversation

Sevenannn
Copy link

@Sevenannn Sevenannn commented Oct 15, 2024

Which issue does this PR close?

Rationale for this change

This PR discards the functional dependencies when generating the UNION logical plan, thus helping avoid FDs that no longer exist being used by further operations, e.g. aggregation.

When the datafusion logical planner build the AGGREGATE plan, it adds additional columns in the group_expr based on the functional dependencies.

However, for queries that are aggregating upon table obatined through UNION operation, the functional dependency is still preserved in the schema of UNION plan, while the functional dependency no longer retains after the UNION.

Table 1:

col1 | col2
-----|-----
a    | 1
b    | 2

Table 2:

col1 | col2
-----|-----
a    | 2
b    | 4

In both Table1 and Table2, the functional dependency col1 -> col2 holds. However, when select * from table1 UNION select * from table2, the functional dependency col1 -> col2 no longer holds.

This causes trouble in further aggregation based on UNION results, consider the following query:

with t1 as (
    select i_manufact_id, count(*) as extra from item
    group by i_manufact_id
),
t2 as (
    select i_manufact_id, count(*) as extra from item
    group by i_manufact_id
)
select i_manufact_id, sum(extra)
 from  (select * from t1
        union all
        select * from t2) tmp1
 group by i_manufact_id
 order by i_manufact_id;

Due to the wrongly preserved functional dependency, the query generates wrong logical plan in the final aggregation step

Aggregate: groupBy=[[tmp1.i_manufact_id, tmp1.extra]], aggr=[[sum(tmp1. extra)]]       

In the test added, the result would contain duplicated groups without changes made in this PR:
image

What changes are included in this PR?

  • Changes to eliminate functional dependency when building UNION logical plan
  • Unit test to verify the changes

Are these changes tested?

Yes

Are there any user-facing changes?

The target columns described by FD will no longer be wrongly included in the aggregate group_by columns

@Sevenannn Sevenannn marked this pull request as ready for review October 15, 2024 22:06
@Sevenannn Sevenannn merged commit 4e34dad into spiceai-42 Oct 16, 2024
@Sevenannn Sevenannn deleted the qianqian/fix-union-fd branch October 16, 2024 19:20
Sevenannn added a commit that referenced this pull request Oct 17, 2024
…an (#44)

* Don't preserve functional dependency when generating UNION logical plan

* Remove extra lines
Sevenannn added a commit that referenced this pull request Oct 26, 2024
…an (#44) (apache#12979)

* Don't preserve functional dependency when generating UNION logical plan

* Remove extra lines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants