Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SQL planner support for EXISTS subqueries #2344

Merged
merged 5 commits into from
Apr 27, 2022

Conversation

andygrove
Copy link
Member

Which issue does this PR close?

Closes #2219

Rationale for this change

Add SQL query planner support for EXISTS subqueries

What changes are included in this PR?

The SQL query planner has been updated so that it passes information about the outer schema to subqueries, so that correlated subqueries can resolve fields from the outer query.

Are there any user-facing changes?

Yes, SQL queries with EXISTS subqueries will now result in valid logical plan instead of an error.

@andygrove andygrove self-assigned this Apr 26, 2022
@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Apr 26, 2022
@andygrove
Copy link
Member Author

@alamb Here is the first PR to actually support subqueries in the SQL planner. I'm sure there will be follow up work to make sure the optimizer rules work correctly but this at least enables the creation of a logical plan from a SQL query containing EXISTS subqueries.

@alamb
Copy link
Contributor

alamb commented Apr 26, 2022

I plan to review this shortly

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the plans are right -- I think we need to model them as Joins (or something more sophisticated than Filters with two inputs)

@@ -443,16 +463,21 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
&self,
t: TableWithJoins,
ctes: &mut HashMap<String, LogicalPlan>,
outer_schema: Option<&DFSchema>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that outer means something already in terms of joins, I wonder if a name like outer_query_schema might potentially cause less confusion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I renamed this.

Comment on lines 4292 to 4296
\n Filter: EXISTS (Subquery: Projection: #person.id, #person.first_name, #person.last_name, #person.age, #person.state, #person.salary, #person.birth_date, #person.😀\
\n Filter: #person.last_name = #p.last_name AND #person.state = #p.state\
\n TableScan: person projection=None)\
\n SubqueryAlias: p\
\n TableScan: person projection=None";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what this plan is doing (as it seems to have a LogicalPlan::Filter node that has multiple inputs which I didn't think was possible 🤔 )

With a correlated subquery like this I think the classic plan uses a Semi-Join like:

"Projection: #p.id\
  Join(type=Semi): #person.last_name = #p.last_name AND #person.state = #p.state\
    TableScan: person projection=None)\
    SubqueryAlias: p\
      TableScan: person projection=None";

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query is hard to parse here and I have pushed a change to improve the formatting in the test. There is one filter in the subquery and one filter in the outer query.

My plan was to implement an optimizer rule to handle re-writing to a semi-join. This rule would be applied regardless of whether the query was created by the SQL planner or via the DataFrame API. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the change. It is much easier to see what is going on now.

@alamb alamb merged commit 5a1ee4e into apache:master Apr 27, 2022
@andygrove andygrove deleted the sql-planner-exists branch April 27, 2022 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add SQL query planning support for EXISTS subqueries
2 participants