Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: support LATERAL joins #3713

Merged
merged 8 commits into from
Jul 28, 2020
Merged

Conversation

benesch
Copy link
Member

@benesch benesch commented Jul 23, 2020

This is still pretty panicky on some of the Postgres test cases, but it mostly sorta works, so putting up in case anyone is curious.

A LATERAL join allows the right-hand side of the join to access columns
defined on the left-hand side of the join. A simple example from the
PostgreSQL docs is the query

SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss

which is equivalent to:

SELECT * FROM foo, bar WHERE bar.id = foo.bar_id;

The hope is that LATERAL joins will be useful for expressing "top-k
within a group", as in:

SELECT * FROM
    (SELECT DISTINCT cat FROM foo) grp,
    JOIN LATERAL (SELECT * FROM foo WHERE foo.cat = grp.cat ORDER BY foo.val LIMIT $k)

Fixes MaterializeInc/database-issues#1020.
Fixes MaterializeInc/database-issues#1158.


This change is Reviewable

@benesch benesch force-pushed the lateral-join branch 3 times, most recently from 038a614 to c01af79 Compare July 23, 2020 20:39
@benesch benesch requested a review from frankmcsherry July 23, 2020 20:39
@benesch
Copy link
Member Author

benesch commented Jul 23, 2020

Ok, this is ready for a look. Might make sense to sit down together @frankmcsherry so I can walk you through the approach.

We pass all the PostgreSQL lateral join tests that I managed to port, save for two that cause "arrangement alarmingly absent" panics that I am hopeful are not my fault. Getting these tests to pass required fixing various unrelated bugs in the SQL planner, which is why there are so many commits in this stack.

The performance of many of these queries is abysmal, so we'll just have to warn folks to be really careful with these joins until our optimizer gets better (if ever; lateral joins are a great way to write n^2^2^2^2 queries).

@benesch
Copy link
Member Author

benesch commented Jul 24, 2020

Some debugging info from one of the panicky queries: https://gist.github.com/benesch/288674bae2fcd6376ceaa2714547ed03

Comment on lines 196 to 208
| Get materialize.public.x (u5)
| ArrangeBy (#0)

%1 =
| Get materialize.public.x (u5)
| FlatMap generate_series(1, #0)
| | demand = (#0..#2)
| | demand = (#0, #2)

%2 =
| Join %0 %1 (= #0 #2)
| | implementation = Differential %1 %0.(#0)
| | demand = (#0, #1, #4)
| Project (#0, #1, #4)
Copy link
Contributor

@frankmcsherry frankmcsherry Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a beefy regression of a class we should sort out, somehow. I expect this falls under the general heading of "branch doesn't always need to join the left side again". We should at least decide who's job we want to make doing that, though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the subsequent EXPLAIN diffs; certainly going to be perf regressions there, pretty substantial to for something like json decoding (FlatMap is currently stateless; these joins will break that and have us keep the whole input collection around).

Copy link
Contributor

@frankmcsherry frankmcsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, everything looks sane here except for some serious planning regressions. We should sort out if we can live with these briefly, or figure out how to fix them first. I think they are nearly fatal for certain workloads (e.g. using TVF for decoding JSON), though I don't think the actual regressions demonstrate that case.

@benesch benesch force-pushed the lateral-join branch 2 times, most recently from 8e872dc to fb64d76 Compare July 25, 2020 04:54
Comment on lines 467 to 468
| RelationExpr::Threshold { .. }
| RelationExpr::Negate { .. }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure that these two need to be in here. Do you have a sense for if they need to be, or if they just weren't obviously safe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter! Anyway, reworked, see incoming comment.

benesch added 8 commits July 27, 2020 17:03
These are MSSQL-specific and therefore irrelevant to us.
These are MSSQL-specific and therefore irrelevant to us.
It is confusing to use TableFactor::Table for both

   FROM foo

and:

   FROM foo (a, b, c)

With this patch, the latter gets a separate TableFactor::Function
variant.
PostgreSQL permits table functions to be preceded by the optional noise
word LATERAL, as in:

    SELECT * FROM foo, LATERAL bar()

This does not change the semantics of the query, as table functions are
always implicitly planned as if LATERAL was specified. This commit
teaches the parser to accept this syntax, but not to retain the
distinction.
This allows creating a new nested "scope" in SQL for lateral joins, as
required for e.g.

    FROM a, LATERAL (...) b

without going through a scalar expression scope first. (Most forms of
correlation in SQL occur in scalar position.)
This function is very much not-generalizable, and gets quite tricky with
LATERAL joins. Just inline it in its one call site.
Using the visit_mut helper makes it difficult (impossible?) to increase
`depth` for certain RelationExprs, which will be important for LATERAL
joins. Just manually recur rather than using the visit_mut helper.

There are intended to be no semantic changes in this commit.
A LATERAL join allows the right-hand side of the join to access columns
defined on the left-hand side of the join. A simple example from the
PostgreSQL docs is the query

    SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss

which is equivalent to:

    SELECT * FROM foo, bar WHERE bar.id = foo.bar_id;

The hope is that LATERAL joins will be useful for expressing "top-k
within a group", as in:

   SELECT * FROM
       (SELECT DISTINCT cat FROM foo) grp,
       JOIN LATERAL (SELECT * FROM foo WHERE foo.cat = grp.cat ORDER BY foo.val LIMIT $k)
@benesch
Copy link
Member Author

benesch commented Jul 27, 2020

Ok, I think this is in a place where this can merge. I adjusted the new branch planning to be very, very targeted; see here: d37fe57#diff-ce359fcdca08c6161d027c3fd75d8b39R633-R690

With this new approach, there are (actually!) zero planning regressions—at least as far as our tests can tell.

@frankmcsherry
Copy link
Contributor

Looks good to me!

@benesch benesch merged commit 210caae into MaterializeInc:main Jul 28, 2020
@benesch benesch deleted the lateral-join branch July 28, 2020 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants