Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: functional dependency optimization #7438

Open
leiysky opened this issue Sep 1, 2022 · 12 comments
Open

Feature: functional dependency optimization #7438

leiysky opened this issue Sep 1, 2022 · 12 comments
Assignees
Labels
A-planner Area: planner/optimizer C-feature Category: feature

Comments

@leiysky
Copy link
Contributor

leiysky commented Sep 1, 2022

A functional dependency is a relationship between two attributes in the same database.

If an attribute determines another attribute, we will call it a determinant, and the determined attribute is functionally dependent on the determinant.

We will denote the relationship with $X \rightarrow Y$ if $X$ is a determinant of $Y$, where both $X$ and $Y$ can be a set of attributes.

For example, given a SQL:

select a, a+1 as b from t;

Here the attribute a is a determinant of b because b is derived from a. Thus we can get a functional dependency $(a) \rightarrow (b)$.

Functional dependency is helpful in query optimization, with it we can eliminate outer joins, eliminate useless DISTINCT, eliminate cross joins and etc.

Reference wiki and the thesis for more details.

The functional dependencies can be derived from relational operators, so it's possible to encapsulate them into RelationalProperty.

@leiysky leiysky added C-feature Category: feature A-planner Area: planner/optimizer labels Sep 1, 2022
@leiysky leiysky moved this to 📒Backlog in Databend Query Planner Sep 1, 2022
@AngleNet
Copy link
Contributor

AngleNet commented Sep 8, 2022

@leiysky HI. I would like to take this issue. Is there any one working on this? I swear I will finish this on my own ~~~ :(

@leiysky
Copy link
Contributor Author

leiysky commented Sep 8, 2022

@leiysky HI. I would like to take this issue. Is there any one working on this? I swear I will finish this on my own ~~~ :(

We have a plan to implement this after #6547 is done.

You are still welcome to take the issue, but I suggest you make a proposal first so we can review it and give some advice since this is an important infrastructure for the optimizer.

@AngleNet
Copy link
Contributor

AngleNet commented Sep 8, 2022

Great. I will write a proposal on this.

@BohuTANG
Copy link
Member

BohuTANG commented Sep 8, 2022

Can this function be optimized?
#7468

@AngleNet
Copy link
Contributor

AngleNet commented Sep 8, 2022

Can this function be optimized?
#7468

I think so. I will consider it inside the proposal.

@andylokandy
Copy link
Collaborator

#7468

This is about CSE optimization. Is it included in functional dependency optimization?

@AngleNet
Copy link
Contributor

#7468

This is about CSE optimization. Is it included in functional dependency optimization?

HI. I think we could do this via functional dependency. What do you mean by CSE optimization?

@andylokandy
Copy link
Collaborator

andylokandy commented Sep 13, 2022

CSE is abbr of Common Subexpression Elimination. Let's explain with an example: Given query SELECT max(a+1), (a+1)/2 FROM t1, the CSE optimizer can rewrite it into SELECT max(b), b/2 FROM (SELECT a+1 AS b FROM t1). The actual implementation usually is adding a column for the sub-expression and then hiding it in the top projection, without constructing such sub-query.

I'm not familiar with functional dependency, but from the very limited knowledge I've learned from the wiki, functional dependency is discovered from the relation between data of different columns and can be broken during updates on data.

@AngleNet
Copy link
Contributor

AngleNet commented Sep 14, 2022

@andylokandy @BohuTANG Sorry. After some investigations, I think I was wrong about doing CSE optimization (the duplicate json_parse elimination) via functional dependency analysis. It's basically another problem to solve. For CSE optimization, I think we could build a common sub-expression collector and rewrite project list or predicate conditions in a select-where-from block via discovered CSEs. The basic idea of the CSE collector is to rewrite each expression in a canonical form which uniquely identify a set of semantically equal expressions, search CSE as visiting it. If the leaf node of an expression is a column during canonical form rewriting, rewrite it to its equivalent column with smallest column id according to an equality inferrer which could be used to infer whether two columns are equal.

@leiysky
Copy link
Contributor Author

leiysky commented Sep 15, 2022

@AngleNet I've read your proposal, but would you mind putting the document into a GitHub issue so we can review it without login Tencent account?

@AngleNet
Copy link
Contributor

@AngleNet I've read your proposal, but would you mind putting the document into a GitHub issue so we can review it without login Tencent account?

Good. I will put it in a issue later.

@AngleNet
Copy link
Contributor

@AngleNet I've read your proposal, but would you mind putting the document into a GitHub issue so we can review it without login Tencent account?

Good. I will put it in a issue later.

The draft doc is #7693

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-planner Area: planner/optimizer C-feature Category: feature
Projects
Status: 📒Backlog
Development

No branches or pull requests

5 participants