Initial support for functional dependencies handling primary key and unique constraints #7040

mustafasrepo · 2023-07-20T14:32:08Z

Which issue does this PR close?

Closes #6190.

Rationale for this change

Primary key information specifies that some of the columns in the table, would consist of unique values. This contract enables us to support additional queries (esp. on unbounded data), currently cannot be run by datafusiuon.
Consider the query below

SELECT sn, amount
            FROM sales_global
            GROUP BY sn

When sn is PRIMARY KEY (each sn will have unique values) we know that, all the columns the table sales_global can be emitted after aggregation (since for each group all rows would have same value). In terms of functionality, above query and following query

SELECT sn, amount
            FROM sales_global
            GROUP BY sn, amount

are same. The reason is that, since column sn already consists of unique values; both GROUP BY sn, amount and GROUP BY sn will produce groups that consist of single row.

SELECT sn, amount
            FROM sales_global
            GROUP BY sn

Above query can run in Postgre. However, datafusion can only emit sn after aggregation from the original table. With this PR above query also supported by Datafusion.

As part of this PR, we keep track of identifier key (where primary key is a special case) to accomplish this. To illusturate what an identifier key is
Consider table below

a	b
1	USD
2	USD
3	TRY

In table above, column a consists of unique values, it is primary key. We store this information as

FunctionalDependency {
  source_indices: vec![0], 
  target_indices: vec![0, 1]),  
  nullable: false, 
  mode: Dependency::Single,
}

this means that, when we know the value of 0th column, we will know the values of 0th and 1st columns deterministically and each key will be unique.
As an another example, consider table below,

a	b
1	USD
1	USD
2	USD
2	USD
3	TRY
3	TRY

In the example above, column a is not primary key, however, still knowing value of a enables us to know value of b deterministically. We encode this information as

FunctionalDependency {
  source_indices: vec![0], 
  target_indices: vec![0, 1]),  
  nullable: false, 
  mode: Dependency::Multi,
}

All of these information are propagated from the primary key information at the source. These analysis enables us to support complex queries. As an example, while we can support query below

SELECT r.sn, SUM(l.amount), r.amount
  FROM sales_global_with_pk AS l
  JOIN sales_global_with_pk AS r
  ON l.sn >= r.sn
  GROUP BY r.sn
  ORDER BY r.sn

we reject, following query

SELECT l.sn, r.sn, SUM(l.amount), r.amount
  FROM sales_global_with_pk AS l
  JOIN sales_global_with_pk AS r
  ON l.sn >= r.sn
  GROUP BY l.sn
  ORDER BY l.sn

because second query, uses an unassociated column(r.sn) with l.sn after group by, whereas, first query uses only associated columns(r.sn, r.amount) with r.sn during projection.

What changes are included in this PR?

Are these changes tested?

Yes, new tests that show supported functionality are added to the groupby.slt file.

Are there any user-facing changes?

# Conflicts: # datafusion/core/tests/sql/timestamp.rs # datafusion/core/tests/sqllogictests/test_files/groupby.slt # datafusion/expr/src/logical_plan/builder.rs

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Co-authored-by: Metehan Yıldırım <100111937+metesynnada@users.noreply.github.com>

…ada-ai/arrow-datafusion into feature/primary_key_utilize

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

alamb · 2023-07-22T16:22:12Z

Marking as draft given @ozankabak's suggestion #7040 (comment) as feedback is incorporated

alamb · 2023-07-22T16:35:04Z

Sorry -- I missed the update. I will review this shortly

ozankabak · 2023-07-24T10:14:49Z

@alamb, we decided to do a couple more refactors to support constraints involving composite keys, primary key constraints and uniqueness constraints neatly within this PR. We will only leave foreign keys as future work so that TD is minimal.

I am marking this as a draft until it is ready for you.

alamb · 2023-07-24T13:20:02Z

@alamb, we decided to do a couple more refactors to support constraints involving composite keys, primary key constraints and uniqueness constraints neatly within this PR. We will only leave foreign keys as future work so that TD is minimal.

Sounds good -- thank you @ozankabak

mustafasrepo · 2023-07-25T15:27:52Z

@alamb, I have implemented your suggestions. Now we support unique and primary key constraints at the source (foreign key not supported yet). I have extended existing tests, to cover unique and composite primary key cases. It is now ready for further review

ozankabak · 2023-07-25T16:38:20Z

This is looking very good now! It provides a good foundation to build even more functional dependencies as we move forward and we already have some plans for that :)

alamb · 2023-07-25T19:10:55Z

Awesome -- can't wait to review this. Will do so first thing tomorrow

alamb

Thank you @mustafasrepo -- I went over this PR carefully and it looks really nice to me

Suggestions for this PR or follow on PRs:

double check that the constraint calculation should be the sql planner.
Consider adding Constraints to make the TableProvider cleaner

I left some other minor comments but otherwise I think this is good to merge

alamb · 2023-07-26T18:09:36Z

datafusion/core/tests/sqllogictests/test_files/groupby.slt

+SortPreservingMergeExec: [sn@0 ASC NULLS LAST]
+--SortExec: expr=[sn@0 ASC NULLS LAST]
+----ProjectionExec: expr=[sn@0 as sn, amount@1 as amount, 2 * CAST(sn@0 AS Int64) as Int64(2) * s.sn]
+------AggregateExec: mode=FinalPartitioned, gby=[sn@0 as sn, amount@1 as amount], aggr=[]


If the plan is grouping on a unique column I think the AggregateExec could be avoided entirely because each group will have exactly one non null row

Maybe that would be good optimization to add in a follow on PR

datafusion/expr/src/logical_plan/ddl.rs

alamb · 2023-07-26T18:23:04Z

datafusion/sql/src/select.rs

@@ -431,6 +432,37 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
        group_by_exprs: Vec<Expr>,
        aggr_exprs: Vec<Expr>,
    ) -> Result<(LogicalPlan, Vec<Expr>, Option<Expr>)> {
+        let schema = input.schema();


Could you explain why the dependency logic needs to be in the SQL planner? It seems more general than just SQL (it should apply to dataframes as well)

Also, this function calls LogicalPlanBuilder::aggregate which then eventally calls Aggregate::try_new_with_schema which also re-calculates functional dependency information

Maybe this logic could be put in Aggregate::try_new_with_schema if it is not redundant

In here, we update group by exprs by the expressions that are in target indices (if they are in the select exprs). With this change, we can use dependents that are among select exprs, when their determinant is inside group by expression.

We can do so inside Aggregate::try_new_with_schema if we pass select_expr argument to it. I thought it is a bit weird for aggregate to receive select_expr argument. After this PR merges, I can file another PR, that implements this change. Then we can decide which implementation is better. In any case, I have moved groupby update logic to its own function to not clutter this method.

I thought it is a bit weird for aggregate to receive select_expr argument.

Yes I agree that sounds weird. I still don't understand why the functional dependency can't be computed from the plan itself (e.g. from the annotations on DFSchema and then the structure of the LogicalPlan::Aggregate / LogicalPlan::Project, etc)

If the logic is in the sql planner, that implies the calculation relies on information that is not encoded in the plan itself (and thus may be lost in subsequent optimizer passes, for example)

After this PR merges, I can file another PR, that implements this change. Then we can decide which implementation is better. In any case, I have moved groupby update logic to its own function to not clutter this method.

Sounds good

Yes I agree that sounds weird. I still don't understand why the functional dependency can't be computed from the plan itself

Currently, if the query is

SELECT sn, amount FROM sales_global GROUP BY sn

and we know that sn is primary key. We rewrite query above as

SELECT sn, amount FROM sales_global GROUP BY sn, amount

so that select either contains group by expressions or aggregate expressions. This update is done here.
However, if the query were

SELECT sn FROM sales_global GROUP BY sn

we wouldn't rewrite query as

SELECT sn FROM sales_global GROUP BY sn, amount

hence, amount is not used and. We do not need to update group by here.

For this reason, we need the information of select exprs to decide how to extend group by expressions to support a query.
However, if we were to extend group by expressions with their functional dependencies in all cases. We won't need the select exprs for this decision. Such as, we would treat

SELECT sn FROM sales_global GROUP BY sn

as

SELECT sn FROM sales_global GROUP BY sn, amount

under the hood.

What do you think about extending group by expression in all cases, without checking for select exprs. I think this is a bit confusing.

In short, functional dependency can be calculated without external information. However, query rewrite cannot be done without resorting to the information in the select exprs in current design.

What do you think about extending group by expression in all cases, without checking for select exprs. I think this is a bit confusing.

I agree

In short, functional dependency can be calculated without external information. However, query rewrite cannot be done without resorting to the information in the select exprs in current design.

I think this is the key point I was missing -- that the code in the sql planner is actually doing a SQL level rewrite. Thank you for clarifying.

What would you think about pulling the rewrite code into its own function and adding a clear explanation of what it was doing (basically copy/paste your example from above)?

Thanks @alamb. I have opened a mini PR for the update of docstring.

datafusion/sql/src/statement.rs

datafusion/sql/tests/sql_integration.rs

datafusion/optimizer/src/analyzer/count_wildcard_rule.rs

datafusion/common/src/dfschema.rs

# Conflicts: # datafusion/core/tests/sqllogictests/test_files/groupby.slt

alamb · 2023-07-27T11:29:01Z

Thanks @mustafasrepo and @ozankabak -- this is a really nice step forward

mustafasrepo and others added 30 commits June 20, 2023 16:31

Initial commit

43f5c9a

Add primary key to DFSchema

7693076

store primary key in metadata during schema

ddde0c4

all tests pass

16db70d

simplifications

b401c67

Move test to the .slt file

dd9715f

Merge branch 'main' into feature/primary_key_utilize

0e34baf

# Conflicts: # datafusion/core/tests/sql/timestamp.rs # datafusion/core/tests/sqllogictests/test_files/groupby.slt # datafusion/expr/src/logical_plan/builder.rs

simplifications

945029f

Simplifications

72eba73

primary key with associated expression indices

511353e

boilerplate for primary key propagation between executors

4986bda

Add new tests

8fe9852

Add Projection primary key handling

da1b947

Keep primary key as vec

d6cf418

Move hash map to struct

9242581

simplifications

89719d9

Update comments

b3632da

Merge branch 'main' into feature/primary_key_utilize

355daf8

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Merge branch 'main' into feature/primary_key_utilize

b738c07

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Merge branch 'main' into feature/primary_key_utilize

67c6efa

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Merge branch 'main' into feature/primary_key_utilize

d735773

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Update datafusion/core/src/physical_planner.rs

f152b58

Co-authored-by: Metehan Yıldırım <100111937+metesynnada@users.noreply.github.com>

Merge branch 'feature/primary_key_utilize' of https://github.com/synn…

638cc64

…ada-ai/arrow-datafusion into feature/primary_key_utilize

Remove unnecessary code

639c55c

Rename, update comments

dabf833

Merge branch 'main' into feature/primary_key_utilize

5b60c42

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Simplifications

43d47f6

Minor changes

170cb87

minor changes

3acef96

Merge branch 'main' into feature/primary_key_utilize

093091b

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

alamb marked this pull request as draft July 22, 2023 16:21

alamb marked this pull request as ready for review July 22, 2023 16:34

ozankabak changed the title ~~Utilize Primary key information, to support additional queries~~ Initial support for functional dependencies handling primary key and unique constraints Jul 24, 2023

ozankabak marked this pull request as draft July 24, 2023 10:15

Change primary key API

349910f

mustafasrepo and others added 5 commits July 24, 2023 17:01

Fix test

9c0f6d0

Functional dependency review

6e64b0c

Address TODO, fix failing tests

ecb6b5b

Merge with apache main

091126a

Fix TODO.

8866b30

mustafasrepo marked this pull request as ready for review July 25, 2023 15:25

alamb approved these changes Jul 26, 2023

View reviewed changes

mustafasrepo added 6 commits July 27, 2023 10:27

Address reviews

558e7dc

Implement Constraints struct

c4cd98b

Convert some pub functions to private

45ce249

Minor changes

d219bfe

Merge branch 'apache_main' into feature/primary_key_utilize

d4644ed

# Conflicts: # datafusion/core/tests/sqllogictests/test_files/groupby.slt

Minor changes

c989acd

alamb merged commit 5f03146 into apache:main Jul 27, 2023

mustafasrepo mentioned this pull request Jul 27, 2023

[doc], [minor]. Update docstring of group by rewrite. #7111

Merged

mustafasrepo mentioned this pull request Nov 29, 2023

Add PRIMARY KEY Aggregate support to dataframe API #8356

Merged

goldmedal mentioned this pull request Aug 12, 2024

Move wildcard expansions to the analyzer #11681

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial support for functional dependencies handling primary key and unique constraints #7040

Initial support for functional dependencies handling primary key and unique constraints #7040

mustafasrepo commented Jul 20, 2023 •

edited by ozankabak

Loading

alamb commented Jul 22, 2023

alamb commented Jul 22, 2023

ozankabak commented Jul 24, 2023 •

edited

Loading

alamb commented Jul 24, 2023

mustafasrepo commented Jul 25, 2023

ozankabak commented Jul 25, 2023

alamb commented Jul 25, 2023

alamb left a comment

alamb Jul 26, 2023

alamb Jul 26, 2023

mustafasrepo Jul 27, 2023 •

edited

Loading

alamb Jul 27, 2023

mustafasrepo Jul 27, 2023

alamb Jul 27, 2023

mustafasrepo Jul 27, 2023

alamb commented Jul 27, 2023

Initial support for functional dependencies handling primary key and unique constraints #7040

Initial support for functional dependencies handling primary key and unique constraints #7040

Conversation

mustafasrepo commented Jul 20, 2023 • edited by ozankabak Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Jul 22, 2023

alamb commented Jul 22, 2023

ozankabak commented Jul 24, 2023 • edited Loading

alamb commented Jul 24, 2023

mustafasrepo commented Jul 25, 2023

ozankabak commented Jul 25, 2023

alamb commented Jul 25, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb Jul 26, 2023

Choose a reason for hiding this comment

alamb Jul 26, 2023

Choose a reason for hiding this comment

mustafasrepo Jul 27, 2023 • edited Loading

Choose a reason for hiding this comment

alamb Jul 27, 2023

Choose a reason for hiding this comment

mustafasrepo Jul 27, 2023

Choose a reason for hiding this comment

alamb Jul 27, 2023

Choose a reason for hiding this comment

mustafasrepo Jul 27, 2023

Choose a reason for hiding this comment

alamb commented Jul 27, 2023

mustafasrepo commented Jul 20, 2023 •

edited by ozankabak

Loading

ozankabak commented Jul 24, 2023 •

edited

Loading

mustafasrepo Jul 27, 2023 •

edited

Loading