Fix the `str_flatten` function in Redshift #805

hdplsa · 2022-03-25T15:52:42Z

Fixes #804

This pull request adds a Redshift specific translation to str_flatten. Currently the default Postgres translation (string_agg) is used, but it is not supported in Redshift. The equivalent function in Redshift is LISTAGG but it has a slightly different syntax (requires WITHIN GROUP (ORDER BY ...) syntax for ordering).

Please check below the reprex with the expected output:

library(dbplyr)
library(DBI)
library(reprex)

con <- dbConnect(RPostgres::Redshift(), 
                 host = ,
                 dbname = , 
                 port = ,
                 user = , 
                 password = ) 

example_table <- dplyr::tribble(
  ~customer, ~day, ~item,
  "A", 1, "WATER",
  "A", 3, "BREAD",
  "A", 2, "JUICE",
  "B", 1, "APPLE",
  "B", 4, "BANANA",
  "C", 1, "MILK"
)

table_db <- dplyr::copy_to(con, example_table, temporary = T)

table_db %>%
  dplyr::group_by(customer) %>%
  dplyr::summarize(flat_string = str_flatten(item, "-"))
#> # Source:   [?? x 2]
#> # Database: postgres
#> #   []
#>   customer flat_string      
#>   <chr>    <chr>            
#> 1 B        APPLE-BANANA     
#> 2 C        MILK             
#> 3 A        WATER-BREAD-JUICE

table_db %>%
  dplyr::group_by(customer) %>%
  dplyr::summarize(flat_string = str_flatten(item, "-")) %>% 
  dplyr::show_query()
#> <SQL>
#> SELECT "customer", LISTAGG("item", '-') AS "flat_string"
#> FROM "example_table"
#> GROUP BY "customer"

table_db %>%
  dplyr::group_by(customer) %>%
  dbplyr::window_order(day) %>%
  dplyr::mutate(flat_string = str_flatten(item, "-")) 
#> # Source:     [?? x 4]
#> # Database:   postgres
#> #   []
#> # Groups:     customer
#> # Ordered by: day
#>   customer   day item   flat_string      
#>   <chr>    <dbl> <chr>  <chr>            
#> 1 A            1 WATER  WATER-JUICE-BREAD
#> 2 A            2 JUICE  WATER-JUICE-BREAD
#> 3 A            3 BREAD  WATER-JUICE-BREAD
#> 4 C            1 MILK   MILK             
#> 5 B            1 APPLE  APPLE-BANANA     
#> 6 B            4 BANANA APPLE-BANANA

table_db %>%
  dplyr::group_by(customer) %>%
  dbplyr::window_order(day) %>%
  dplyr::mutate(flat_string = str_flatten(item, "-")) %>%
  dplyr::show_query()
#> <SQL
#> SELECT
#>   "customer",
#>   "day",
#>   "item",
#>   LISTAGG("item", '-') WITHIN GROUP (ORDER BY "day") OVER (PARTITION BY "customer") AS "flat_string"
#> FROM "example_table"

^{Created on 2022-03-25 by the reprex package (v2.0.0)}

hadley · 2022-03-31T19:56:31Z

R/backend-redshift.R

+        order <- win_current_order()
+        if(length(order) > 0){
+          sql <- build_sql(sql_expr(LISTAGG(!!x, !!collapse)),
+                           " WITHIN GROUP (ORDER BY ", order, ")")


Are there other functions in redshift that use this syntax? I'm surprised that (say) windowed mean() doesn't need the same syntax? Or is LISTAGG() the same sort of function as PERCENTILE_DISC()?

From searching the documentation only the LISTAGG(), PERCENTILE_DISC(), PERCENTILE_CONT(), and ST_COLLECT() functions in redshift require the WITHIN GROUP clause for the ordering. The remaining windowed functions use the more general OVER (PARTITION BY ... ORDER BY ...) clause.

Unfortunately, I cannot explain why LISTAGG() in particular uses the WITHIN GROUP syntax. But this syntax seems to be shared by other database types (e.g. Oracle).

Thanks for the investigation! Unfortunately there seem to be few satisfying explanation for why SQL things are the way they are.

hadley

Can you please add a bullet to the top of NEWS.md? It should briefly describe the change and end with (@yourname, #issuenumber).

hdplsa · 2022-04-01T12:28:01Z

Just added the bullet to the NEWS.md. Thanks @hadley.

hadley · 2022-04-01T13:34:21Z

Thanks!

Fix the str_flatten function in Redshift

2ee1e93

hadley reviewed Mar 31, 2022

View reviewed changes

hadley approved these changes Mar 31, 2022

View reviewed changes

Update the NEWS file

8a00e9f

hdplsa and others added 2 commits April 1, 2022 14:29

Merge branch 'main' into master

c7c56e8

Add LISTAGG to the global variables

9ec9984

hadley merged commit cdadbde into tidyverse:main Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the `str_flatten` function in Redshift #805

Fix the `str_flatten` function in Redshift #805

hdplsa commented Mar 25, 2022 •

edited

Loading

hadley Mar 31, 2022

hdplsa Apr 1, 2022

hadley Apr 1, 2022

hadley left a comment

hdplsa commented Apr 1, 2022

hadley commented Apr 1, 2022

Fix the str_flatten function in Redshift #805

Fix the str_flatten function in Redshift #805

Conversation

hdplsa commented Mar 25, 2022 • edited Loading

hadley Mar 31, 2022

Choose a reason for hiding this comment

hdplsa Apr 1, 2022

Choose a reason for hiding this comment

hadley Apr 1, 2022

Choose a reason for hiding this comment

hadley left a comment

Choose a reason for hiding this comment

hdplsa commented Apr 1, 2022

hadley commented Apr 1, 2022

Fix the `str_flatten` function in Redshift #805

Fix the `str_flatten` function in Redshift #805

hdplsa commented Mar 25, 2022 •

edited

Loading