Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the str_flatten function in Redshift #805

Merged
merged 4 commits into from
Apr 1, 2022
Merged

Conversation

hdplsa
Copy link
Contributor

@hdplsa hdplsa commented Mar 25, 2022

Fixes #804

This pull request adds a Redshift specific translation to str_flatten. Currently the default Postgres translation (string_agg) is used, but it is not supported in Redshift. The equivalent function in Redshift is LISTAGG but it has a slightly different syntax (requires WITHIN GROUP (ORDER BY ...) syntax for ordering).

Please check below the reprex with the expected output:

library(dbplyr)
library(DBI)
library(reprex)

con <- dbConnect(RPostgres::Redshift(), 
                 host = ,
                 dbname = , 
                 port = ,
                 user = , 
                 password = ) 

example_table <- dplyr::tribble(
  ~customer, ~day, ~item,
  "A", 1, "WATER",
  "A", 3, "BREAD",
  "A", 2, "JUICE",
  "B", 1, "APPLE",
  "B", 4, "BANANA",
  "C", 1, "MILK"
)

table_db <- dplyr::copy_to(con, example_table, temporary = T)

table_db %>%
  dplyr::group_by(customer) %>%
  dplyr::summarize(flat_string = str_flatten(item, "-"))
#> # Source:   [?? x 2]
#> # Database: postgres
#> #   []
#>   customer flat_string      
#>   <chr>    <chr>            
#> 1 B        APPLE-BANANA     
#> 2 C        MILK             
#> 3 A        WATER-BREAD-JUICE

table_db %>%
  dplyr::group_by(customer) %>%
  dplyr::summarize(flat_string = str_flatten(item, "-")) %>% 
  dplyr::show_query()
#> <SQL>
#> SELECT "customer", LISTAGG("item", '-') AS "flat_string"
#> FROM "example_table"
#> GROUP BY "customer"

table_db %>%
  dplyr::group_by(customer) %>%
  dbplyr::window_order(day) %>%
  dplyr::mutate(flat_string = str_flatten(item, "-")) 
#> # Source:     [?? x 4]
#> # Database:   postgres
#> #   []
#> # Groups:     customer
#> # Ordered by: day
#>   customer   day item   flat_string      
#>   <chr>    <dbl> <chr>  <chr>            
#> 1 A            1 WATER  WATER-JUICE-BREAD
#> 2 A            2 JUICE  WATER-JUICE-BREAD
#> 3 A            3 BREAD  WATER-JUICE-BREAD
#> 4 C            1 MILK   MILK             
#> 5 B            1 APPLE  APPLE-BANANA     
#> 6 B            4 BANANA APPLE-BANANA

table_db %>%
  dplyr::group_by(customer) %>%
  dbplyr::window_order(day) %>%
  dplyr::mutate(flat_string = str_flatten(item, "-")) %>%
  dplyr::show_query()
#> <SQL
#> SELECT
#>   "customer",
#>   "day",
#>   "item",
#>   LISTAGG("item", '-') WITHIN GROUP (ORDER BY "day") OVER (PARTITION BY "customer") AS "flat_string"
#> FROM "example_table"

Created on 2022-03-25 by the reprex package (v2.0.0)

order <- win_current_order()
if(length(order) > 0){
sql <- build_sql(sql_expr(LISTAGG(!!x, !!collapse)),
" WITHIN GROUP (ORDER BY ", order, ")")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other functions in redshift that use this syntax? I'm surprised that (say) windowed mean() doesn't need the same syntax? Or is LISTAGG() the same sort of function as PERCENTILE_DISC()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From searching the documentation only the LISTAGG(), PERCENTILE_DISC(), PERCENTILE_CONT(), and ST_COLLECT() functions in redshift require the WITHIN GROUP clause for the ordering. The remaining windowed functions use the more general OVER (PARTITION BY ... ORDER BY ...) clause.

Unfortunately, I cannot explain why LISTAGG() in particular uses the WITHIN GROUP syntax. But this syntax seems to be shared by other database types (e.g. Oracle).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation! Unfortunately there seem to be few satisfying explanation for why SQL things are the way they are.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a bullet to the top of NEWS.md? It should briefly describe the change and end with (@yourname, #issuenumber).

@hdplsa
Copy link
Contributor Author

hdplsa commented Apr 1, 2022

Just added the bullet to the NEWS.md. Thanks @hadley.

@hadley hadley merged commit cdadbde into tidyverse:main Apr 1, 2022
@hadley
Copy link
Member

hadley commented Apr 1, 2022

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

str_flatten does not work in Redshift
2 participants