rowwise mutate across with c_across computes incorrectly for some but not all the columns (worked in dplyr 1.0.3 but not in 1.0.4) #5770

ahcyip · 2021-02-18T10:41:54Z

I was trying to compute proportions using rowwise, mutate, across, and c_across, but I'm only getting expected results in the female column and not in the male or NA columns.

#install.packages("palmerpenguins")
library(palmerpenguins); library(tidyverse)
(tbl1 <- penguins %>%
  group_by(species, island, sex) %>%
  summarize(n = n()) %>%
  pivot_wider(names_from = sex, values_from = n))
#> `summarise()` has grouped output by 'species', 'island'. You can override using the `.groups` argument.
#> # A tibble: 5 x 5
#> # Groups:   species, island [5]
#>   species   island    female  male  `NA`
#>   <fct>     <fct>      <int> <int> <int>
#> 1 Adelie    Biscoe        22    22    NA
#> 2 Adelie    Dream         27    28     1
#> 3 Adelie    Torgersen     24    23     5
#> 4 Chinstrap Dream         34    34    NA
#> 5 Gentoo    Biscoe        58    61     5
(tbl2 <- tbl1 %>%
  ungroup() %>%
  rowwise(species, island) %>%
  mutate(across(everything(), ~ . / sum(c_across(everything()), na.rm = T))))
#> # A tibble: 5 x 5
#> # Rowwise:  species, island
#>   species   island    female  male   `NA`
#>   <fct>     <fct>      <dbl> <dbl>  <dbl>
#> 1 Adelie    Biscoe     0.5   0.978 NA    
#> 2 Adelie    Dream      0.482 0.950  0.411
#> 3 Adelie    Torgersen  0.462 0.808  0.797
#> 4 Chinstrap Dream      0.5   0.986 NA    
#> 5 Gentoo    Biscoe     0.468 0.918  0.783
(tbl3 <- tbl1 %>%
  ungroup() %>%
  rowwise(species, island) %>%
  mutate(across(female:male, ~ . / sum(c_across(female:male), na.rm = T))))
#> # A tibble: 5 x 5
#> # Rowwise:  species, island
#>   species   island    female  male  `NA`
#>   <fct>     <fct>      <dbl> <dbl> <int>
#> 1 Adelie    Biscoe     0.5   0.978    NA
#> 2 Adelie    Dream      0.491 0.983     1
#> 3 Adelie    Torgersen  0.511 0.978     5
#> 4 Chinstrap Dream      0.5   0.986    NA
#> 5 Gentoo    Biscoe     0.487 0.992     5

^{Created on 2021-02-18 by the reprex package (v1.0.0)}

Session info

sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] forcats_0.5.1        stringr_1.4.0        dplyr_1.0.4         
#>  [4] purrr_0.3.4          readr_1.4.0          tidyr_1.1.2         
#>  [7] tibble_3.0.6         ggplot2_3.3.3        tidyverse_1.3.0     
#> [10] palmerpenguins_0.1.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.6        cellranger_1.1.0  compiler_4.0.4    pillar_1.4.7     
#>  [5] dbplyr_2.1.0      highr_0.8         tools_4.0.4       digest_0.6.27    
#>  [9] lubridate_1.7.9.2 jsonlite_1.7.2    evaluate_0.14     lifecycle_1.0.0  
#> [13] gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.10      reprex_1.0.0     
#> [17] cli_2.3.0         DBI_1.1.1         rstudioapi_0.13   yaml_2.2.1       
#> [21] haven_2.3.1       xfun_0.20         withr_2.4.1       xml2_1.3.2       
#> [25] httr_1.4.2        knitr_1.31        hms_1.0.0         generics_0.1.0   
#> [29] fs_1.5.0          vctrs_0.3.6       grid_4.0.4        tidyselect_1.1.0 
#> [33] glue_1.4.2        R6_2.5.0          fansi_0.4.2       readxl_1.3.1     
#> [37] rmarkdown_2.6     modelr_0.1.8      magrittr_2.0.1    scales_1.1.1     
#> [41] backports_1.2.1   ps_1.5.0          htmltools_0.5.1.1 ellipsis_0.3.1   
#> [45] rvest_0.3.6       assertthat_0.2.1  colorspace_2.0-0  utf8_1.1.4       
#> [49] stringi_1.5.3     munsell_0.5.0     broom_0.7.4       crayon_1.4.1

The text was updated successfully, but these errors were encountered:

ahcyip · 2021-02-18T10:45:52Z

Expected result for tbl2:
0.5 0.5 NA
0.482 0.5, 0.018
etc.

ahcyip · 2021-02-18T10:56:54Z

Worked as expected in dplyr 1.0.2

ahcyip · 2021-02-18T11:03:26Z

I just tried dplyr 1.0.4.9000 and it still appears to be broken. And I tried dplyr 1.0.3 and it worked fine there too.

hadley · 2021-02-18T12:26:33Z

I suspect the problem is this: for across(everything(), ~ . / sum(c_across(everything()) to work, you are relying critically on the c_across() to be evaluated once before each of the individual divisions take place. That seems like a dangerous assumption so I’d suggest computing the sum first.

romainfrancois · 2021-02-18T13:20:23Z

🤔 this looks like a bug in the instrumentation of across() with the internal top_across(), if we disable it with e.g. + we get expected result:

library(palmerpenguins); library(tidyverse)
(tbl1 <- penguins %>%
    group_by(species, island, sex) %>%
    summarize(n = n()) %>%
    pivot_wider(names_from = sex, values_from = n))
#> `summarise()` has grouped output by 'species', 'island'. You can override using the `.groups` argument.
#> # A tibble: 5 x 5
#> # Groups:   species, island [5]
#>   species   island    female  male  `NA`
#>   <fct>     <fct>      <int> <int> <int>
#> 1 Adelie    Biscoe        22    22    NA
#> 2 Adelie    Dream         27    28     1
#> 3 Adelie    Torgersen     24    23     5
#> 4 Chinstrap Dream         34    34    NA
#> 5 Gentoo    Biscoe        58    61     5

tbl1 %>%
  ungroup() %>%
  rowwise(species, island) %>%
  mutate(across(everything(), ~ . / sum(c_across(everything()), na.rm = T)))
#> # A tibble: 5 x 5
#> # Rowwise:  species, island
#>   species   island    female  male   `NA`
#>   <fct>     <fct>      <dbl> <dbl>  <dbl>
#> 1 Adelie    Biscoe     0.5   0.978 NA    
#> 2 Adelie    Dream      0.482 0.950  0.411
#> 3 Adelie    Torgersen  0.462 0.808  0.797
#> 4 Chinstrap Dream      0.5   0.986 NA    
#> 5 Gentoo    Biscoe     0.468 0.918  0.783

tbl1 %>%
  ungroup() %>%
  rowwise(species, island) %>%
  mutate(+across(everything(), ~ . / sum(c_across(everything()), na.rm = T)))
#> # A tibble: 5 x 5
#> # Rowwise:  species, island
#>   species   island    female  male    `NA`
#>   <fct>     <fct>      <dbl> <dbl>   <dbl>
#> 1 Adelie    Biscoe     0.5   0.5   NA     
#> 2 Adelie    Dream      0.482 0.5    0.0179
#> 3 Adelie    Torgersen  0.462 0.442  0.0962
#> 4 Chinstrap Dream      0.5   0.5   NA     
#> 5 Gentoo    Biscoe     0.468 0.492  0.0403

^{Created on 2021-02-18 by the reprex package (v0.3.0)}

I'll have a look, another approach would be to instead take the result from across() and scale it, i.e.

library(palmerpenguins); library(tidyverse)
(tbl1 <- penguins %>%
    group_by(species, island, sex) %>%
    summarize(n = n()) %>%
    pivot_wider(names_from = sex, values_from = n))
#> `summarise()` has grouped output by 'species', 'island'. You can override using the `.groups` argument.
#> # A tibble: 5 x 5
#> # Groups:   species, island [5]
#>   species   island    female  male  `NA`
#>   <fct>     <fct>      <int> <int> <int>
#> 1 Adelie    Biscoe        22    22    NA
#> 2 Adelie    Dream         27    28     1
#> 3 Adelie    Torgersen     24    23     5
#> 4 Chinstrap Dream         34    34    NA
#> 5 Gentoo    Biscoe        58    61     5

scale <- function(x) {
  x / sum(x, na.rm = TRUE)
}

tbl1 %>%
  ungroup() %>%
  rowwise(species, island) %>%
  mutate(
    scale(across(everything()))
  )
#> # A tibble: 5 x 5
#> # Rowwise:  species, island
#>   species   island    female  male    `NA`
#>   <fct>     <fct>      <dbl> <dbl>   <dbl>
#> 1 Adelie    Biscoe     0.5   0.5   NA     
#> 2 Adelie    Dream      0.482 0.5    0.0179
#> 3 Adelie    Torgersen  0.462 0.442  0.0962
#> 4 Chinstrap Dream      0.5   0.5   NA     
#> 5 Gentoo    Biscoe     0.468 0.492  0.0403

^{Created on 2021-02-18 by the reprex package (v0.3.0)}

romainfrancois · 2021-02-18T14:29:49Z

Actually, there seems to be a confusion between across() (which would trigger expanding the expressions) and +across() which would not:

library(dplyr, warn.conflicts = FALSE)
df <- tibble(x = 2, y = 4, z = 8)

df %>% mutate_all(~ .x / y)
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   0.5     1     8
df %>% mutate(across(everything(), ~ .x / y))
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   0.5     1     8
# these do:
df %>% mutate(x = x / y, y = y / y, z = z / y)
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   0.5     1     8

df %>% mutate(+across(everything(), ~ .x / y))
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   0.5     1     2

# but this does:
df %>% mutate(data.frame(x = x / y, y = y / y, z = z / y))
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   0.5     1     2

^{Created on 2021-02-18 by the reprex package (v0.3.0)}

I believe that the last results are correct, and that the instrumented across() needs fixing, as well as some tests apparently 😱.

closes #5770

ahcyip · 2021-02-18T22:31:56Z

Thanks for addressing and fixing this!
Since I'm pivoting anyway, I should probably stick to regular group_by for now
group_by(a,b,c) %>% summarize(x/sum(x))
But looking forward to a future with a mature rowwise / group_by_row :)

ahcyip changed the title ~~rowwise mutate across with c_across seems to be doing something weird to some but not all the columns~~ rowwise mutate across with c_across seems to compute incorrectly for some but not all the columns Feb 18, 2021

ahcyip changed the title ~~rowwise mutate across with c_across seems to compute incorrectly for some but not all the columns~~ rowwise mutate across with c_across computes incorrectly for some but not all the columns (worked in dplyr 1.0.2 but not in 1.0.4) Feb 18, 2021

romainfrancois self-assigned this Feb 18, 2021

romainfrancois added the bug an unexpected problem or unintended behavior label Feb 18, 2021

romainfrancois added this to the 1.0.5 milestone Feb 18, 2021

romainfrancois added a commit that referenced this issue Feb 18, 2021

roll back to dev to deal with #5770

642186b

romainfrancois added a commit that referenced this issue Feb 18, 2021

Fix how across() works non-sequentially in mutate() and summarise()

a5fd510

closes #5770

romainfrancois mentioned this issue Feb 18, 2021

across() not operating sequentially #5773

Merged

romainfrancois closed this as completed in a30c9e4 Feb 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rowwise mutate across with c_across computes incorrectly for some but not all the columns (worked in dplyr 1.0.3 but not in 1.0.4) #5770

rowwise mutate across with c_across computes incorrectly for some but not all the columns (worked in dplyr 1.0.3 but not in 1.0.4) #5770

ahcyip commented Feb 18, 2021 •

edited

Loading

ahcyip commented Feb 18, 2021

ahcyip commented Feb 18, 2021

ahcyip commented Feb 18, 2021 •

edited

Loading

hadley commented Feb 18, 2021 •

edited

Loading

romainfrancois commented Feb 18, 2021

romainfrancois commented Feb 18, 2021

ahcyip commented Feb 18, 2021

rowwise mutate across with c_across computes incorrectly for some but not all the columns (worked in dplyr 1.0.3 but not in 1.0.4) #5770

rowwise mutate across with c_across computes incorrectly for some but not all the columns (worked in dplyr 1.0.3 but not in 1.0.4) #5770

Comments

ahcyip commented Feb 18, 2021 • edited Loading

ahcyip commented Feb 18, 2021

ahcyip commented Feb 18, 2021

ahcyip commented Feb 18, 2021 • edited Loading

hadley commented Feb 18, 2021 • edited Loading

romainfrancois commented Feb 18, 2021

romainfrancois commented Feb 18, 2021

ahcyip commented Feb 18, 2021

ahcyip commented Feb 18, 2021 •

edited

Loading

ahcyip commented Feb 18, 2021 •

edited

Loading

hadley commented Feb 18, 2021 •

edited

Loading