Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CTE #656

Closed
wants to merge 11 commits into from
Closed

Use CTE #656

wants to merge 11 commits into from

Conversation

mgirlich
Copy link
Collaborator

@mgirlich mgirlich commented May 19, 2021

Closes #638.

I hacked together the support for CTE. Adding CTEs to dbplyr definitely seems doable. Together with custom join aliases this would greatly improve readability.

There might be some limitations where CTEs cannot be used or where this implementation doesn't work. Therefore, I added a parameter cte so the user can decide whether to use CTE clauses or not. I'd propose to not use CTEs by default (at least for now) and see whether things work fine. It might be worth to use an option as default value.

Some notes

  • rendering the CTEs needs to be done in the first function calling sql_render() due to the "branching" in join and set op queries.
  • In sql_render() for select_query, set_op_query, join_query, semi_join_query we need to register the subqueries and give them names. This is currently done via passing a list.
  • Currently, it is quite a hack how the four functions above know about cte = TRUE. It would probably be nicer to pass cte explicitly but I had the feeling that quite a lot of functions would need to know about it and pass it along. I'll check again later if this is doable or not.
  • It would be nice if one could use custom names for the CTEs but this comes with a couple of questions (we probably need more subqueries than the user anticipates; what if the user specifies a name but the next step could also be included to generate the sql; ...)
devtools::load_all("~/GitHub/dbplyr/")

flights_db <- tbl_memdb(nycflights13::flights)
airports_db <- tbl_memdb(nycflights13::airports)

Example for select query

select_query <- flights_db %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(delay, count > 20, dest != "HNL")
-- show_query(select_query, cte = TRUE)
WITH `q01` AS (
SELECT `dest`, COUNT(*) AS `count`, AVG(`distance`) AS `dist`, AVG(`arr_delay`) AS `delay`
FROM `nycflights13::flights`
GROUP BY `dest`
)
SELECT *
FROM `q01`
WHERE ((`delay`) AND (`count` > 20.0) AND (`dest` != 'HNL'))

-- show_query(select_query, cte = FALSE)
SELECT *
FROM (SELECT `dest`, COUNT(*) AS `count`, AVG(`distance`) AS `dist`, AVG(`arr_delay`) AS `delay`
FROM `nycflights13::flights`
GROUP BY `dest`)
WHERE ((`delay`) AND (`count` > 20.0) AND (`dest` != 'HNL'))

Example for join query

join_query <- left_join(
  select_query,
  airports_db %>% 
    select(faa, name, tzone),
  by = c(dest = "faa")
)
-- show_query(join_query, cte = TRUE)
<SQL>
WITH `q01` AS (
SELECT `dest`, COUNT(*) AS `count`, AVG(`distance`) AS `dist`, AVG(`arr_delay`) AS `delay`
FROM `nycflights13::flights`
GROUP BY `dest`
),
`q02` AS (
SELECT *
FROM `q01`
WHERE ((`delay`) AND (`count` > 20.0) AND (`dest` != 'HNL'))
),
`q03` AS (
SELECT `faa`, `name`, `tzone`
FROM `nycflights13::airports`
)
SELECT `dest`, `count`, `dist`, `delay`, `name`, `tzone`
FROM `q02` AS `LHS`
LEFT JOIN `q03` AS `RHS`
ON (`LHS`.`dest` = `RHS`.`faa`)

-- show_query(join_query, cte = FALSE)
<SQL>
SELECT `dest`, `count`, `dist`, `delay`, `name`, `tzone`
FROM (SELECT *
FROM (SELECT `dest`, COUNT(*) AS `count`, AVG(`distance`) AS `dist`, AVG(`arr_delay`) AS `delay`
FROM `nycflights13::flights`
GROUP BY `dest`)
WHERE ((`delay`) AND (`count` > 20.0) AND (`dest` != 'HNL'))) AS `LHS`
LEFT JOIN (SELECT `faa`, `name`, `tzone`
FROM `nycflights13::airports`) AS `RHS`
ON (`LHS`.`dest` = `RHS`.`faa`)

Reuse CTE

union(
  select_query,
  select_query
) %>% 
  show_query(cte = TRUE)
WITH `q01` AS (
SELECT `dest`, COUNT(*) AS `count`, AVG(`distance`) AS `dist`, AVG(`arr_delay`) AS `delay`
FROM `nycflights13::flights`
GROUP BY `dest`
),
`q02` AS (
SELECT *
FROM `q01`
WHERE ((`delay`) AND (`count` > 20.0) AND (`dest` != 'HNL'))
)
SELECT *
FROM `q02`
UNION
SELECT *
FROM `q02`

@mgirlich mgirlich requested a review from krlmlr July 30, 2021 11:56
@mgirlich mgirlich marked this pull request as ready for review August 13, 2021 12:05
Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplicity of this approach and the very small size of the patch. The query_list argument is a bit "magical" -- its type determines if CTEs are used.

Could sql_render() return an intermediate data structure that is identical for the CTE and non-CTE case? Perhaps a query container that combines raw SQL, subqueries, and literal values. The final composition would then either embed the subqueries or use them as CTEs, and also either embed the literals inline or use placeholders and query parameters.

@mgirlich
Copy link
Collaborator Author

I like the simplicity of this approach and the very small size of the patch. The query_list argument is a bit "magical" -- its type determines if CTEs are used.

Yes, it is a bit magical... But this was the easiest way I came up with to combine few changes and no breaking changes.

Could sql_render() return an intermediate data structure that is identical for the CTE and non-CTE case?

Strictly speaking this would be a breaking change but it should not affect much code. This would also be useful for further optimising the generated SQL code (e.g. using SELECT, WHERE, ... directly in a join query, see #722).

Perhaps a query container that combines raw SQL, subqueries, and literal values. The final composition would then either embed the subqueries or use them as CTEs, and also either embed the literals inline or use placeholders and query parameters.

I'm not sure I fully understand what you mean here. Can you give an example?

@mgirlich mgirlich mentioned this pull request Nov 24, 2021
@mgirlich mgirlich mentioned this pull request Dec 3, 2021
@mgirlich
Copy link
Collaborator Author

mgirlich commented Dec 6, 2021

I implemented CTEs using the new lazy render pipeline in this PR #729. It looks way nicer than this approach.

@mgirlich
Copy link
Collaborator Author

mgirlich commented Mar 9, 2022

Closed in favour of #790.

@mgirlich mgirlich closed this Mar 9, 2022
@mgirlich mgirlich deleted the cte branch May 24, 2022 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve readability of subqueries
2 participants