Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce the ability to express a correlated subquery #119

Closed
jacques-n opened this issue Jan 4, 2022 · 3 comments · Fixed by #134
Closed

Introduce the ability to express a correlated subquery #119

jacques-n opened this issue Jan 4, 2022 · 3 comments · Fixed by #134

Comments

@jacques-n
Copy link
Contributor

jacques-n commented Jan 4, 2022

We should introduce a way to express a correlated subquery. We need to come up with the best way to represent this. Some prior art:

  • Calcite - A filter relation that defines a dynamic variable of it's data that is used in a new expression type that contains a subtree. The subquery subtree references the fields of this dynamic variable to connect the sub and main trees. ([dynamic variable].[field name] e.g. $cor0.P_PARTKEY)
  • Trino - A relational node type that has a special set symbols from outer tree that are then referenced via special expression assignments from the inner tree. (I think, just a quick review of the code.)
  • Spark Looks like an expression that contains a subtree, a set of outer expressions and an exprId (that I believe is used inside the inner subtree). This looks/feels a bit like Calcite although Calcite also has the agg functions & groupings.

Other examples people think that should provide inspiration?

@jacques-n
Copy link
Contributor Author

For reference, the following tpch queries require a subquery (or decorrelation): 2, 4, 11, 15, 16, 17, 18, 20, 21.

@cpcloud
Copy link
Contributor

cpcloud commented Jan 24, 2022

After looking into this a bit there are a number of things to address:

What is the expression type of a subquery (correlated or not) used in, for example, a WHERE clause?

I'm not entirely sure whether a subquery should be a distinct relation, or whether it's a special kind of Expression that has a Rel as input.

This has meaningful consequences for the proto implementation, since this would require an expression to reference a relation, which requires expressions to import relations and this isn't allowed in protobuf (protocolbuffers/protobuf#5504). We would have to introduce a level of indirection using something like a relation id that would allow a subquery Expression to refer to a relation.

What is the right way to track outer references at multiple levels?

For example, a subquery with a subquery with a subquery that references the top-level relation.

  1. It looks like in Calcite, there's a unique id for a given relation so there's no need to explicitly track something like the "level" of an outer query.
  2. In Trino, it looks like there's a function that accumulates the outer refs
  3. I'm not sure how spark manages to capture refs above the current outer query's scope

@jacques-n
Copy link
Contributor Author

My inclination is definitely towards a new type of expression. (Let's ignore the proto impl issue for the moment.)

In terms of determining the level, do you think there is a way of defining this with relative numbering? I'd really prefer that over the naming scheme that things like Calcite seem to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants