Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating temporary column (using window function) for filtering causes error in dbplyr 2.3.1 #1244

Closed
krprasangdas opened this issue Apr 12, 2023 · 3 comments · Fixed by #1240

Comments

@krprasangdas
Copy link

Generating temporary column (using window function) for filtering causes error in dbplyr version 2.3.1.
Dbplyr version 2.3.0 does not have this issue.
Dbplyr version 2.3.1 changelog mentions a breaking change which does not seem applicable to this, given its description, but this could be a side-effect of fix for issue #1103.

Below reprex uses dbplyr version 2.3.0 where query is constructed without issues.

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(sparklyr)
#> 
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:stats':
#> 
#>     filter

spark_disconnect_all()
#> [1] 0
sc <- spark_connect("local")

# Copy mpg dataset to spark
mpgs <- mpg %>% 
  copy_to(sc, ., "mpg", TRUE)

# Generating temporary column for filterting
mpgs %>% 
  group_by(manufacturer, model) %>% 
  arrange(desc(drv)) %>% 
  mutate(is1999 = first(year) == 1999) %>% 
  filter(is1999) %>% 
  select(-is1999) %>% 
  show_query
#> Warning: ORDER BY is ignored in subqueries without LIMIT
#> ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
#> <SQL>
#> SELECT
#>   `manufacturer`,
#>   `model`,
#>   `displ`,
#>   `year`,
#>   `cyl`,
#>   `trans`,
#>   `drv`,
#>   `cty`,
#>   `hwy`,
#>   `fl`,
#>   `class`
#> FROM (
#>   SELECT
#>     *,
#>     FIRST_VALUE(`year`) OVER (PARTITION BY `manufacturer`, `model` ORDER BY `drv` DESC) = 1999.0 AS `is1999`
#>   FROM `mpg`
#> ) `q01`
#> WHERE (`is1999`)

# Not generating temporary column for filtering
mpgs %>% 
  group_by(manufacturer, model) %>% 
  arrange(desc(drv)) %>% 
  filter(first(year) == 1999) %>% 
  show_query
#> Warning: ORDER BY is ignored in subqueries without LIMIT
#> ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
#> <SQL>
#> SELECT
#>   `manufacturer`,
#>   `model`,
#>   `displ`,
#>   `year`,
#>   `cyl`,
#>   `trans`,
#>   `drv`,
#>   `cty`,
#>   `hwy`,
#>   `fl`,
#>   `class`
#> FROM (
#>   SELECT
#>     *,
#>     FIRST_VALUE(`year`) OVER (PARTITION BY `manufacturer`, `model` ORDER BY `drv` DESC) AS `q02`
#>   FROM `mpg`
#> ) `q01`
#> WHERE (`q02` = 1999.0)

Created on 2023-04-12 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.0.2 (2020-06-22)
#>  os       Red Hat Enterprise Linux
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Chicago
#>  date     2023-04-12
#>  pandoc   2.19.2 @ /opt/revr/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  askpass       1.1     2019-01-13 [2] RSPM (R 4.0.2)
#>  assertthat    0.2.1   2019-03-21 [2] RSPM (R 4.0.2)
#>  base64enc     0.1-3   2015-07-28 [2] RSPM (R 4.0.2)
#>  blob          1.2.3   2022-04-10 [2] RSPM (R 4.0.2)
#>  cli           3.6.0   2023-01-09 [2] RSPM (R 4.0.2)
#>  colorspace    2.1-0   2023-01-23 [2] RSPM (R 4.0.2)
#>  config        0.3.1   2020-12-17 [2] RSPM (R 4.0.2)
#>  DBI           1.1.3   2022-06-18 [2] RSPM (R 4.0.2)
#>  dbplyr      * 2.3.0   2023-01-16 [2] RSPM (R 4.0.2)
#>  digest        0.6.31  2022-12-11 [2] RSPM (R 4.0.2)
#>  dplyr       * 1.1.0   2023-01-29 [2] RSPM (R 4.0.2)
#>  ellipsis      0.3.2   2021-04-29 [2] RSPM (R 4.0.2)
#>  evaluate      0.20    2023-01-17 [2] RSPM (R 4.0.2)
#>  fansi         1.0.4   2023-01-22 [2] RSPM (R 4.0.2)
#>  fastmap       1.1.0   2021-01-25 [2] RSPM (R 4.0.2)
#>  forge         0.2.0   2019-02-26 [2] RSPM (R 4.0.2)
#>  fs            1.6.1   2023-02-06 [2] RSPM (R 4.0.2)
#>  generics      0.1.3   2022-07-05 [2] RSPM (R 4.0.2)
#>  ggplot2     * 3.4.1   2023-02-10 [2] RSPM (R 4.0.2)
#>  glue          1.6.2   2022-02-24 [2] RSPM (R 4.0.2)
#>  gtable        0.3.1   2022-09-01 [2] RSPM (R 4.0.2)
#>  htmltools     0.5.4   2022-12-07 [2] RSPM (R 4.0.2)
#>  htmlwidgets   1.6.1   2023-01-07 [2] RSPM (R 4.0.2)
#>  httr          1.4.4   2022-08-17 [2] RSPM (R 4.0.2)
#>  jsonlite      1.8.4   2022-12-06 [2] RSPM (R 4.0.2)
#>  knitr         1.42    2023-01-25 [2] RSPM (R 4.0.2)
#>  lifecycle     1.0.3   2022-10-07 [2] RSPM (R 4.0.2)
#>  magrittr      2.0.3   2022-03-30 [2] RSPM (R 4.0.2)
#>  munsell       0.5.0   2018-06-12 [2] RSPM (R 4.0.2)
#>  openssl       2.0.5   2022-12-06 [2] RSPM (R 4.0.2)
#>  pillar        1.8.1   2022-08-19 [2] RSPM (R 4.0.2)
#>  pkgconfig     2.0.3   2019-09-22 [2] RSPM (R 4.0.2)
#>  purrr         1.0.1   2023-01-10 [2] RSPM (R 4.0.2)
#>  R.cache       0.16.0  2022-07-21 [2] RSPM (R 4.0.2)
#>  R.methodsS3   1.8.2   2022-06-13 [2] RSPM (R 4.0.2)
#>  R.oo          1.25.0  2022-06-12 [2] RSPM (R 4.0.2)
#>  R.utils       2.12.2  2022-11-11 [2] RSPM (R 4.0.2)
#>  r2d3          0.2.6   2022-02-28 [2] RSPM (R 4.0.2)
#>  R6            2.5.1   2021-08-19 [2] RSPM (R 4.0.2)
#>  reprex        2.0.2   2022-08-17 [2] RSPM (R 4.0.2)
#>  rlang         1.0.6   2022-09-24 [2] RSPM (R 4.0.2)
#>  rmarkdown     2.20    2023-01-19 [2] RSPM (R 4.0.2)
#>  rprojroot     2.0.3   2022-04-02 [2] RSPM (R 4.0.2)
#>  rstudioapi    0.14    2022-08-22 [2] RSPM (R 4.0.2)
#>  scales        1.2.1   2022-08-20 [2] RSPM (R 4.0.2)
#>  sessioninfo   1.2.2   2021-12-06 [2] RSPM (R 4.0.2)
#>  sparklyr    * 1.7.9   2022-12-08 [2] RSPM (R 4.0.2)
#>  styler        1.9.0   2023-01-15 [2] RSPM (R 4.0.2)
#>  tibble        3.1.8   2022-07-22 [2] RSPM (R 4.0.2)
#>  tidyr         1.3.0   2023-01-24 [2] RSPM (R 4.0.2)
#>  tidyselect    1.2.0   2022-10-10 [2] RSPM (R 4.0.2)
#>  utf8          1.2.3   2023-01-31 [2] RSPM (R 4.0.2)
#>  vctrs         0.5.2   2023-01-23 [2] RSPM (R 4.0.2)
#>  withr         2.5.0   2022-03-03 [2] RSPM (R 4.0.2)
#>  xfun          0.37    2023-01-31 [2] RSPM (R 4.0.2)
#>  yaml          2.3.7   2023-01-23 [2] RSPM (R 4.0.2)
#> 
#>  [1] /shared/nas/shape/bu_dsaid/transient/pkumard/R/x86_64-pc-linux-gnu-library/3.5
#>  [2] /opt/revr/library/4.0/2023-02-17/cran
#>  [3] /opt/revr/library/4.0/2023-02-17/bio
#>  [4] /opt/revr/library/4.0/2023-02-17/github
#>  [5] /opt/revr/library/4.0/2023-02-17/misc
#>  [6] /opt/revr/library/4.0/2023-02-17/tu
#>  [7] /opt/revr/ropen/4.0.2/lib64/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Below reprex uses dbplyr version 2.3.1 where query construction fails for the example with temporary column in it.

Below reprex also consists of an example with successful query construction which does not use the temporary column in it. For this case, the table alias and column alias are both q01. Should that be ok / is it an accepted practice? Wouldn't it interfere with understandability for someone reading the generated SQL? dbplyr version 2.3.0 uses q01 to denote table alias and q02 to denote column alias for the same example.

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(sparklyr)
#> 
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:stats':
#> 
#>     filter

spark_disconnect_all()
#> [1] 0
sc <- spark_connect("local")

# Copy mpg dataset to spark
mpgs <- mpg %>% 
  copy_to(sc, ., "mpg", TRUE)

# Generating temporary column for filterting
mpgs %>% 
  group_by(manufacturer, model) %>% 
  arrange(desc(drv)) %>% 
  mutate(is1999 = first(year) == 1999) %>% 
  filter(is1999) %>% 
  select(-is1999) %>% 
  show_query
#> Error in `purrr::map_chr()`:
#> ℹ In index: 1.
#> Caused by error in `as_string()`:
#> ! Can't convert a call to a string.
#> Backtrace:
#>      ▆
#>   1. ├─... %>% show_query
#>   2. ├─dplyr::show_query(.)
#>   3. ├─dplyr::select(., -is1999)
#>   4. ├─sparklyr:::select.tbl_spark(., -is1999)
#>   5. ├─base::NextMethod()
#>   6. └─dbplyr:::select.tbl_lazy(., -is1999)
#>   7.   └─dbplyr:::add_select(.data, new_vars)
#>   8.     └─dbplyr:::rename_order(lazy_query, vars)
#>   9.       └─purrr::map_chr(order, as_name)
#>  10.         └─purrr:::map_("character", .x, .f, ..., .progress = .progress)
#>  11.           ├─purrr:::with_indexed_errors(...)
#>  12.           │ └─base::withCallingHandlers(...)
#>  13.           ├─purrr:::call_with_cleanup(...)
#>  14.           └─rlang (local) .f(.x[[i]], ...)
#>  15.             └─rlang::as_string(x)
#>  16.               └─rlang:::abort_coercion(x, "a string")
#>  17.                 └─rlang::abort(msg, call = call)

# Not generating temporary column for filtering
mpgs %>% 
  group_by(manufacturer, model) %>% 
  arrange(desc(drv)) %>% 
  filter(first(year) == 1999) %>% 
  show_query
#> Warning: ORDER BY is ignored in subqueries without LIMIT
#> ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
#> <SQL>
#> SELECT
#>   `manufacturer`,
#>   `model`,
#>   `displ`,
#>   `year`,
#>   `cyl`,
#>   `trans`,
#>   `drv`,
#>   `cty`,
#>   `hwy`,
#>   `fl`,
#>   `class`
#> FROM (
#>   SELECT
#>     *,
#>     FIRST_VALUE(`year`) OVER (PARTITION BY `manufacturer`, `model` ORDER BY `drv` DESC) AS `q01`
#>   FROM `mpg`
#> ) `q01`
#> WHERE (`q01` = 1999.0)

Created on 2023-04-12 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.0.2 (2020-06-22)
#>  os       Red Hat Enterprise Linux
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Chicago
#>  date     2023-04-12
#>  pandoc   2.19.2 @ /opt/revr/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  askpass       1.1     2019-01-13 [2] RSPM (R 4.0.2)
#>  assertthat    0.2.1   2019-03-21 [2] RSPM (R 4.0.2)
#>  base64enc     0.1-3   2015-07-28 [2] RSPM (R 4.0.2)
#>  blob          1.2.3   2022-04-10 [2] RSPM (R 4.0.2)
#>  cli           3.6.0   2023-01-09 [2] RSPM (R 4.0.2)
#>  colorspace    2.1-0   2023-01-23 [2] RSPM (R 4.0.2)
#>  config        0.3.1   2020-12-17 [2] RSPM (R 4.0.2)
#>  DBI           1.1.3   2022-06-18 [2] RSPM (R 4.0.2)
#>  dbplyr      * 2.3.1   2023-02-24 [2] RSPM (R 4.0.2)
#>  digest        0.6.31  2022-12-11 [2] RSPM (R 4.0.2)
#>  dplyr       * 1.1.1   2023-03-22 [4] CRAN (R 4.0.2)
#>  ellipsis      0.3.2   2021-04-29 [2] RSPM (R 4.0.2)
#>  evaluate      0.20    2023-01-17 [2] RSPM (R 4.0.2)
#>  fansi         1.0.4   2023-01-22 [2] RSPM (R 4.0.2)
#>  fastmap       1.1.1   2023-02-24 [2] RSPM (R 4.0.2)
#>  forge         0.2.0   2019-02-26 [2] RSPM (R 4.0.2)
#>  fs            1.6.1   2023-02-06 [2] RSPM (R 4.0.2)
#>  generics      0.1.3   2022-07-05 [2] RSPM (R 4.0.2)
#>  ggplot2     * 3.4.1   2023-02-10 [2] RSPM (R 4.0.2)
#>  glue          1.6.2   2022-02-24 [2] RSPM (R 4.0.2)
#>  gtable        0.3.1   2022-09-01 [2] RSPM (R 4.0.2)
#>  htmltools     0.5.4   2022-12-07 [2] RSPM (R 4.0.2)
#>  htmlwidgets   1.6.1   2023-01-07 [2] RSPM (R 4.0.2)
#>  httr          1.4.5   2023-02-24 [2] RSPM (R 4.0.2)
#>  jsonlite      1.8.4   2022-12-06 [2] RSPM (R 4.0.2)
#>  knitr         1.42    2023-01-25 [2] RSPM (R 4.0.2)
#>  lifecycle     1.0.3   2022-10-07 [2] RSPM (R 4.0.2)
#>  magrittr      2.0.3   2022-03-30 [2] RSPM (R 4.0.2)
#>  munsell       0.5.0   2018-06-12 [2] RSPM (R 4.0.2)
#>  openssl       2.0.6   2023-03-09 [2] RSPM (R 4.0.2)
#>  pillar        1.8.1   2022-08-19 [2] RSPM (R 4.0.2)
#>  pkgconfig     2.0.3   2019-09-22 [2] RSPM (R 4.0.2)
#>  purrr         1.0.1   2023-01-10 [2] RSPM (R 4.0.2)
#>  R.cache       0.16.0  2022-07-21 [2] RSPM (R 4.0.2)
#>  R.methodsS3   1.8.2   2022-06-13 [2] RSPM (R 4.0.2)
#>  R.oo          1.25.0  2022-06-12 [2] RSPM (R 4.0.2)
#>  R.utils       2.12.2  2022-11-11 [2] RSPM (R 4.0.2)
#>  r2d3          0.2.6   2022-02-28 [2] RSPM (R 4.0.2)
#>  R6            2.5.1   2021-08-19 [2] RSPM (R 4.0.2)
#>  reprex        2.0.2   2022-08-17 [2] RSPM (R 4.0.2)
#>  rlang         1.1.0   2023-03-14 [2] RSPM (R 4.0.2)
#>  rmarkdown     2.20    2023-01-19 [2] RSPM (R 4.0.2)
#>  rprojroot     2.0.3   2022-04-02 [2] RSPM (R 4.0.2)
#>  rstudioapi    0.14    2022-08-22 [2] RSPM (R 4.0.2)
#>  scales        1.2.1   2022-08-20 [2] RSPM (R 4.0.2)
#>  sessioninfo   1.2.2   2021-12-06 [2] RSPM (R 4.0.2)
#>  sparklyr    * 1.8.0   2023-03-07 [2] RSPM (R 4.0.2)
#>  styler        1.9.1   2023-03-04 [2] RSPM (R 4.0.2)
#>  tibble        3.2.0   2023-03-08 [2] RSPM (R 4.0.2)
#>  tidyr         1.3.0   2023-01-24 [2] RSPM (R 4.0.2)
#>  tidyselect    1.2.0   2022-10-10 [2] RSPM (R 4.0.2)
#>  utf8          1.2.3   2023-01-31 [2] RSPM (R 4.0.2)
#>  vctrs         0.6.1   2023-03-22 [4] CRAN (R 4.0.2)
#>  withr         2.5.0   2022-03-03 [2] RSPM (R 4.0.2)
#>  xfun          0.37    2023-01-31 [2] RSPM (R 4.0.2)
#>  yaml          2.3.7   2023-01-23 [2] RSPM (R 4.0.2)
#> 
#>  [1] /opt/revr/library/4.0/2023-03-16/bio
#>  [2] /opt/revr/library/4.0/2023-03-16/cran
#>  [3] /opt/revr/library/4.0/2023-03-16/github
#>  [4] /opt/revr/library/4.0/2023-03-16/misc
#>  [5] /opt/revr/library/4.0/2023-03-16/tu
#>  [6] /opt/revr/library/4.0/2023-03-16/tu_test
#>  [7] /opt/revr/ropen/4.0.2/lib64/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@ejneer
Copy link
Contributor

ejneer commented Apr 13, 2023

This appears to be related to #1206. Your reprex is fixed in devtools::install_github("tidyverse/dbplyr#1240")

@krprasangdas
Copy link
Author

@mgirlich Has this fix been released? Or Is it yet to be released? How long before it is released? Thanks in advance.

@mgirlich
Copy link
Collaborator

Yes, this is fixed and we plan to release soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants