Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: single-use common table expression support #20359

Merged
merged 1 commit into from
Dec 28, 2017

Conversation

jordanlewis
Copy link
Member

@jordanlewis jordanlewis commented Nov 30, 2017

Implement support for simple non-recursive common table expressions.

Referring to a single table expression more than once is not yet
supported.

Fixes #7029.

@jordanlewis jordanlewis requested review from justinj, knz and a team November 30, 2017 19:54
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor

knz commented Nov 30, 2017

The name resolution technique is a bit too ad-hoc for my taste but I'm willing to go along with it for a first iteration.

Please however refrain from storing the logical plan in the naming scope. Store the AST query instead. then re-plan the AST every time the name is used. This is the way every other engine does it, and it gives way to all sorts of optimizations. (And it lifts the "restriction" (bug) in your current approach - that the name can only be used once.)


Reviewed 15 of 15 files at r1.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Nov 30, 2017

of course there's a trick in there which is that whenever you resolve a name when replanning the CTE AST, you need to look up that name in the scope wher ethe CTE was defined, not in the scope where the CTE name is used.


Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@jordanlewis
Copy link
Member Author

TFTR!

It's not correct to re-plan the AST every time the name is used, unfortunately. The contract around CTEs is that every CTE clause gets evaluated exactly once. Take for example

SELECT * FROM z WHERE a IN x

Or something like that - if you had more than one reference to x you would need to ensure it's only evaluated once.

The reason I left that restriction in is that supporting this in general requires some way of storing temporary tables.


Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@jordanlewis
Copy link
Member Author

Oops my comment didn't render correctly. I meant:

WITH x AS (INSERT INTO t VALUES(1))
SELECT * FROM z WHERE a IN x

Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@BramGruneir
Copy link
Member

BramGruneir commented Nov 30, 2017

CTEs are a fairly large feature. Do they not warrant an RFC? At least to pare down the scope.

Also does this really fully fix #7029? Unless we fully support CTEs, I don't think we should close that issue.

@jordanlewis
Copy link
Member Author

I figured I would open more issues if this gets merged - specifically to support WITH RECURSIVE and temporary tables in CTEs.

I don't think this requires an RFC, at least without WITH RECURSIVE. It's a lot more straightforward than it seems from the outside.

@knz
Copy link
Contributor

knz commented Nov 30, 2017

Read-only CTEs can be replanned every time. They are also the most common case

Write queries in CTEs require temporary tables. Your approach is not sufficient.

I kinda agree a RFC would be desirable. But we can support read-only, non-recursive CTEs using the approach I suggested.


Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@jordanlewis
Copy link
Member Author

Why do write queries in CTEs require temporary tables?

@knz
Copy link
Contributor

knz commented Nov 30, 2017

Because the amount of data can be arbitrarily large!

@jordanlewis
Copy link
Member Author

The amount of data where? Returned by an insert in an AS clause? Returned by a select in the AS clause of an insert? I don't follow - unless I'm missing something, we're either already vulnerable to a problem caused by what you're suggesting or there's really nothing wrong here.

Perhaps you could use an example to illustrate your point?

@knz
Copy link
Contributor

knz commented Nov 30, 2017

All right so the situation is as follows:

  1. the most common case is read-only CTEs
  2. as common is the case where a CTE is used multiple times in the remainder of the query.
  3. in the relatively less common case where INSERT/UPDATE are used as CTE (with RETURNING), then the most common case is when they are used multiple times in the remainder of the query.
  4. the case where both a) data-modifying CTEs are used and b) they are only used once is rare and, I posit, not worth supporting at this time.

Meanwhile:

  1. currently our [...] syntax does not allow the result of an INSERT/DELETE/UPDATE to be used multiple times, and therefore it can fit the "streaming model" and does not require accumulating the rows in memory, whereas
  2. trying to support data-modifying CTEs that can be used multiple times will require accumulating the updated/inserted rows in memory, which is not tractable in the general case (There can be an arbitrary amount of modified rows)

So my position is that I do not want to think about situation 6, and I think you're mis-prioritizing by looking at point 4, and to some lesser extent point 3.

If we want to make a difference we need a solution for the most common case - read-only CTEs used in multiple positions. IF that makes your (our) life more difficult for data-modifying CTEs, I don't care (yet).

@jordanlewis
Copy link
Member Author

Thanks for laying that out! I agree that we shouldn't think about situation 6 - temporary tables are way out of scope for this effort.

I disagree with your prioritization - do you have evidence for the frequency of use of these different query types? I think it's likely that queries that that use INSERT ... RETURNING clauses just once are more common than those that use them more than once.

In any case, for now I will modify the PR to plan each query where it's used. Banning write clauses is just as difficult as permitting their use only once, though, so I'll opt for that instead.

@knz
Copy link
Contributor

knz commented Dec 1, 2017

The argument about one time use vs multiple times.

Just recall what the acronym CTE stands for: common table expressions.

This feature was developed starting from the observation that some table expressions are used multiple times and therefore that there was a need for DRY. So yes it's easy to argue that WITH is not used unless there are multiple points of use.

@knz
Copy link
Contributor

knz commented Dec 1, 2017

All right so to recap our other discussion offline:

  • your initial argument is that the case of one-use CTE is common enough (e.g. the issue linked) to warrant that partial support
  • as we both found out, if the SQL standard indeed mandates referential transparency between all uses of a CTEs, this mandates temporary tables (materialization) even for read-only queries because in the general case a SELECT can use non-deterministic functions.

In other words, if there is a possibility to re-plan separate at every point of reuse, this can only be an optimization that kicks in after we have a proof that all possible executions will produce the same rows and that either the ordering is fixed by ORDER BY or all the points of use are order-agnostic (e.g. EXISTS, IN etc). Without this proof then using separate plans for separate uses might not only be inefficient, it is likely simply incorrect.

@knz
Copy link
Contributor

knz commented Dec 1, 2017

So you can dismiss my initial reservations, but please capture the different aspects of this discussion in the commit message. The rationale for the limited implementation must be clear in the proposed release note, and also indicated as a recommended doc scope for the person who will document/explain the partial feature. Think about e.g. Robert and how he will have to explain our partial support to a potential customer.

@BramGruneir
Copy link
Member

BramGruneir commented Dec 1, 2017 via email

@jordanlewis
Copy link
Member Author

Fair enough!

@jordanlewis
Copy link
Member Author

RFC available in #20374.

@jordanlewis jordanlewis changed the title [WIP] sql: simple common table expression support sql: single-use common table expression support Dec 21, 2017
@jordanlewis jordanlewis requested a review from a team as a code owner December 21, 2017 20:44
@jordanlewis
Copy link
Member Author

PTAL.

@knz
Copy link
Contributor

knz commented Dec 27, 2017

I had a more thorough look at the code in light of the RFC. The code does what it says on the label but I have a few minor suggestions for improvement.

A larger concern now is whether the name resolution code is good as-is. The "planDataSource" logic was introduced while I was handling joins but it really belongs to this new scoping data structure you are introducing here. I wonder how much of this code can be merged together / simplified accordingly.


Reviewed 19 of 19 files at r2.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


pkg/sql/with.go, line 18 at r2 (raw file):

import (
	"golang.org/x/net/context"

I think you need to revert this.

(Also, rebase and check with the linter.)


pkg/sql/with.go, line 24 at r2 (raw file):

)

type cteNameEnvironment []cteNameEnvironmentFrame

Add comments throughout to explain the various data structures and functions.

Also add a top-level comment to explain how you manage name scoping, and how the magic works with the resetter function.


pkg/sql/with.go, line 59 at r2 (raw file):

			frame[cte.Name.Alias] = cteSource{plan: ctePlan, alias: cte.Name}
		}
		return func() { p.cteNameEnvironment = p.cteNameEnvironment.pop() }, nil

If you make this func take the planner as parameter, you can define it globally, and thus make it static and avoid a heap allocation.


pkg/sql/sem/tree/with.go, line 43 at r2 (raw file):

		}
		cte.Name.Format(buf, f)
		buf.WriteString(" AS (\n")

This would be the first time we use newline characters during pretty-printing. 1) Are you comfortable doing so? Are there no places where we assume pretty-printed queries fit on a single line? 2) is this necessary?


Comments from Reviewable

Implement support for single-use non-recursive common table expressions.

Referring to a single table expression more than once is not yet
supported.

Release note (sql): add support for single-use common table expressions
@jordanlewis
Copy link
Member Author

Rebased and responded to comments. I'm not sure about changing the name resolution completely. plan data sources are also used in expressions with statement sources as i'm sure you're aware (from [...] syntax).

It's possible that the next time we'll need this kind of naming environment is in correlated subqueries, which I think is the main other example of naming scopes in SQL statements (right)? Maybe aliases too, actually.


Review status: 10 of 23 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


pkg/sql/with.go, line 18 at r2 (raw file):

Previously, knz (kena) wrote…

I think you need to revert this.

(Also, rebase and check with the linter.)

Done.


pkg/sql/with.go, line 24 at r2 (raw file):

Previously, knz (kena) wrote…

Add comments throughout to explain the various data structures and functions.

Also add a top-level comment to explain how you manage name scoping, and how the magic works with the resetter function.

Done.


pkg/sql/with.go, line 59 at r2 (raw file):

Previously, knz (kena) wrote…

If you make this func take the planner as parameter, you can define it globally, and thus make it static and avoid a heap allocation.

Done.


pkg/sql/sem/tree/with.go, line 43 at r2 (raw file):

Previously, knz (kena) wrote…

This would be the first time we use newline characters during pretty-printing. 1) Are you comfortable doing so? Are there no places where we assume pretty-printed queries fit on a single line? 2) is this necessary?

Didn't realize. Removed the newlines.


Comments from Reviewable

@petermattis
Copy link
Collaborator

It's possible that the next time we'll need this kind of naming environment is in correlated subqueries, which I think is the main other example of naming scopes in SQL statements (right)? Maybe aliases too, actually.

I'd be hesitant to change the name resolution code significantly at this time due to the upcoming planner work. See scope in opttoy for how it handles name resolution (properly?) for correlated subqueries. In particular, each scope has a parent scope and we have to walk through the list looking for the name being resolved.

@jordanlewis
Copy link
Member Author

I'd be hesitant to change the name resolution code significantly at this time due to the upcoming planner work.

SGTM, let's merge as is then 😁

@knz
Copy link
Contributor

knz commented Dec 28, 2017

Yes I agree with peter that it is too early to have that conversation - I wasn't suggesting to do anything about it in this PR.

The architecture Peter and Radu are pushing forward will provide a new context to think better about name resolution, even without considering "opttoy". Let's familiarize ourselves with this context before we resume the conversation. However I do want to point out that we need a single solution to satisfy our various naming scope requirements. If the code Peter and Radu cannot be used about this in the coming 6 months, we'll still want to do something about it here.


Reviewed 12 of 13 files at r3.
Review status: 22 of 23 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Dec 28, 2017

:lgtm:


Reviewed 1 of 13 files at r3.
Review status: :shipit: all files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@jordanlewis
Copy link
Member Author

Although I'll note that the naming environments for CTEs are slightly different from those for correlated subqueries - correlated subqueries need to keep track of columns and their sources, whereas CTEs need to keep track of sources only. This might be a small enough difference that the same environment could be used, but it's worth pointing out.

@jordanlewis
Copy link
Member Author

TFTRs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants