Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for functional dependencies handling primary key and unique constraints #7040

Merged
merged 47 commits into from
Jul 27, 2023
Merged

Initial support for functional dependencies handling primary key and unique constraints #7040

merged 47 commits into from
Jul 27, 2023

Conversation

mustafasrepo
Copy link
Contributor

@mustafasrepo mustafasrepo commented Jul 20, 2023

Which issue does this PR close?

Closes #6190.

Rationale for this change

Primary key information specifies that some of the columns in the table, would consist of unique values. This contract enables us to support additional queries (esp. on unbounded data), currently cannot be run by datafusiuon.
Consider the query below

SELECT sn, amount
            FROM sales_global
            GROUP BY sn

When sn is PRIMARY KEY (each sn will have unique values) we know that, all the columns the table sales_global can be emitted after aggregation (since for each group all rows would have same value). In terms of functionality, above query and following query

SELECT sn, amount
            FROM sales_global
            GROUP BY sn, amount

are same. The reason is that, since column sn already consists of unique values; both GROUP BY sn, amount and GROUP BY sn will produce groups that consist of single row.

SELECT sn, amount
            FROM sales_global
            GROUP BY sn

Above query can run in Postgre. However, datafusion can only emit sn after aggregation from the original table. With this PR above query also supported by Datafusion.

As part of this PR, we keep track of identifier key (where primary key is a special case) to accomplish this. To illusturate what an identifier key is
Consider table below

a b
1 USD
2 USD
3 TRY

In table above, column a consists of unique values, it is primary key. We store this information as

FunctionalDependency {
  source_indices: vec![0], 
  target_indices: vec![0, 1]),  
  nullable: false, 
  mode: Dependency::Single,
}

this means that, when we know the value of 0th column, we will know the values of 0th and 1st columns deterministically and each key will be unique.
As an another example, consider table below,

a b
1 USD
1 USD
2 USD
2 USD
3 TRY
3 TRY

In the example above, column a is not primary key, however, still knowing value of a enables us to know value of b deterministically. We encode this information as

FunctionalDependency {
  source_indices: vec![0], 
  target_indices: vec![0, 1]),  
  nullable: false, 
  mode: Dependency::Multi,
}

All of these information are propagated from the primary key information at the source. These analysis enables us to support complex queries. As an example, while we can support query below

SELECT r.sn, SUM(l.amount), r.amount
  FROM sales_global_with_pk AS l
  JOIN sales_global_with_pk AS r
  ON l.sn >= r.sn
  GROUP BY r.sn
  ORDER BY r.sn

we reject, following query

SELECT l.sn, r.sn, SUM(l.amount), r.amount
  FROM sales_global_with_pk AS l
  JOIN sales_global_with_pk AS r
  ON l.sn >= r.sn
  GROUP BY l.sn
  ORDER BY l.sn

because second query, uses an unassociated column(r.sn) with l.sn after group by, whereas, first query uses only associated columns(r.sn, r.amount) with r.sn during projection.

What changes are included in this PR?

Are these changes tested?

Yes, new tests that show supported functionality are added to the groupby.slt file.

Are there any user-facing changes?

mustafasrepo and others added 30 commits June 20, 2023 16:31
# Conflicts:
#	datafusion/core/tests/sql/timestamp.rs
#	datafusion/core/tests/sqllogictests/test_files/groupby.slt
#	datafusion/expr/src/logical_plan/builder.rs
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
Co-authored-by: Metehan Yıldırım <100111937+metesynnada@users.noreply.github.com>
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
@alamb alamb marked this pull request as draft July 22, 2023 16:21
@alamb
Copy link
Contributor

alamb commented Jul 22, 2023

Marking as draft given @ozankabak's suggestion #7040 (comment) as feedback is incorporated

@alamb alamb marked this pull request as ready for review July 22, 2023 16:34
@alamb
Copy link
Contributor

alamb commented Jul 22, 2023

Sorry -- I missed the update. I will review this shortly

@ozankabak ozankabak changed the title Utilize Primary key information, to support additional queries Initial support for functional dependencies handling primary key and unique constraints Jul 24, 2023
@ozankabak
Copy link
Contributor

ozankabak commented Jul 24, 2023

@alamb, we decided to do a couple more refactors to support constraints involving composite keys, primary key constraints and uniqueness constraints neatly within this PR. We will only leave foreign keys as future work so that TD is minimal.

I am marking this as a draft until it is ready for you.

@ozankabak ozankabak marked this pull request as draft July 24, 2023 10:15
@alamb
Copy link
Contributor

alamb commented Jul 24, 2023

@alamb, we decided to do a couple more refactors to support constraints involving composite keys, primary key constraints and uniqueness constraints neatly within this PR. We will only leave foreign keys as future work so that TD is minimal.

Sounds good -- thank you @ozankabak

@mustafasrepo mustafasrepo marked this pull request as ready for review July 25, 2023 15:25
@mustafasrepo
Copy link
Contributor Author

@alamb, I have implemented your suggestions. Now we support unique and primary key constraints at the source (foreign key not supported yet). I have extended existing tests, to cover unique and composite primary key cases. It is now ready for further review

@ozankabak
Copy link
Contributor

This is looking very good now! It provides a good foundation to build even more functional dependencies as we move forward and we already have some plans for that :)

@alamb
Copy link
Contributor

alamb commented Jul 25, 2023

Awesome -- can't wait to review this. Will do so first thing tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mustafasrepo -- I went over this PR carefully and it looks really nice to me

Suggestions for this PR or follow on PRs:

  1. double check that the constraint calculation should be the sql planner.
  2. Consider adding Constraints to make the TableProvider cleaner

I left some other minor comments but otherwise I think this is good to merge

SortPreservingMergeExec: [sn@0 ASC NULLS LAST]
--SortExec: expr=[sn@0 ASC NULLS LAST]
----ProjectionExec: expr=[sn@0 as sn, amount@1 as amount, 2 * CAST(sn@0 AS Int64) as Int64(2) * s.sn]
------AggregateExec: mode=FinalPartitioned, gby=[sn@0 as sn, amount@1 as amount], aggr=[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the plan is grouping on a unique column I think the AggregateExec could be avoided entirely because each group will have exactly one non null row

Maybe that would be good optimization to add in a follow on PR

datafusion/expr/src/logical_plan/ddl.rs Outdated Show resolved Hide resolved
@@ -431,6 +432,37 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
group_by_exprs: Vec<Expr>,
aggr_exprs: Vec<Expr>,
) -> Result<(LogicalPlan, Vec<Expr>, Option<Expr>)> {
let schema = input.schema();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why the dependency logic needs to be in the SQL planner? It seems more general than just SQL (it should apply to dataframes as well)

Also, this function calls LogicalPlanBuilder::aggregate which then eventally calls Aggregate::try_new_with_schema which also re-calculates functional dependency information

Maybe this logic could be put in Aggregate::try_new_with_schema if it is not redundant

Copy link
Contributor Author

@mustafasrepo mustafasrepo Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In here, we update group by exprs by the expressions that are in target indices (if they are in the select exprs). With this change, we can use dependents that are among select exprs, when their determinant is inside group by expression.

We can do so inside Aggregate::try_new_with_schema if we pass select_expr argument to it. I thought it is a bit weird for aggregate to receive select_expr argument. After this PR merges, I can file another PR, that implements this change. Then we can decide which implementation is better. In any case, I have moved groupby update logic to its own function to not clutter this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it is a bit weird for aggregate to receive select_expr argument.

Yes I agree that sounds weird. I still don't understand why the functional dependency can't be computed from the plan itself (e.g. from the annotations on DFSchema and then the structure of the LogicalPlan::Aggregate / LogicalPlan::Project, etc)

If the logic is in the sql planner, that implies the calculation relies on information that is not encoded in the plan itself (and thus may be lost in subsequent optimizer passes, for example)

After this PR merges, I can file another PR, that implements this change. Then we can decide which implementation is better. In any case, I have moved groupby update logic to its own function to not clutter this method.

Sounds good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree that sounds weird. I still don't understand why the functional dependency can't be computed from the plan itself

Currently, if the query is

SELECT sn, amount
            FROM sales_global
            GROUP BY sn

and we know that sn is primary key. We rewrite query above as

SELECT sn, amount
            FROM sales_global
            GROUP BY sn, amount

so that select either contains group by expressions or aggregate expressions. This update is done here.
However, if the query were

SELECT sn
            FROM sales_global
            GROUP BY sn

we wouldn't rewrite query as

SELECT sn
            FROM sales_global
            GROUP BY sn, amount

hence, amount is not used and. We do not need to update group by here.

For this reason, we need the information of select exprs to decide how to extend group by expressions to support a query.
However, if we were to extend group by expressions with their functional dependencies in all cases. We won't need the select exprs for this decision. Such as, we would treat

SELECT sn
            FROM sales_global
            GROUP BY sn

as

SELECT sn
            FROM sales_global
            GROUP BY sn, amount

under the hood.

What do you think about extending group by expression in all cases, without checking for select exprs. I think this is a bit confusing.

In short, functional dependency can be calculated without external information. However, query rewrite cannot be done without resorting to the information in the select exprs in current design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about extending group by expression in all cases, without checking for select exprs. I think this is a bit confusing.

I agree

In short, functional dependency can be calculated without external information. However, query rewrite cannot be done without resorting to the information in the select exprs in current design.

I think this is the key point I was missing -- that the code in the sql planner is actually doing a SQL level rewrite. Thank you for clarifying.

What would you think about pulling the rewrite code into its own function and adding a clear explanation of what it was doing (basically copy/paste your example from above)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb. I have opened a mini PR for the update of docstring.

datafusion/sql/src/statement.rs Outdated Show resolved Hide resolved
datafusion/sql/tests/sql_integration.rs Show resolved Hide resolved
datafusion/common/src/dfschema.rs Outdated Show resolved Hide resolved
datafusion/common/src/dfschema.rs Outdated Show resolved Hide resolved
@alamb alamb merged commit 5f03146 into apache:main Jul 27, 2023
@alamb
Copy link
Contributor

alamb commented Jul 27, 2023

Thanks @mustafasrepo and @ozankabak -- this is a really nice step forward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Utilize PRIMARY KEY information better
3 participants