Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rowwise mutate across with c_across computes incorrectly for some but not all the columns (worked in dplyr 1.0.3 but not in 1.0.4) #5770

Closed
ahcyip opened this issue Feb 18, 2021 · 7 comments
Assignees
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@ahcyip
Copy link
Contributor

ahcyip commented Feb 18, 2021

I was trying to compute proportions using rowwise, mutate, across, and c_across, but I'm only getting expected results in the female column and not in the male or NA columns.

#install.packages("palmerpenguins")
library(palmerpenguins); library(tidyverse)
(tbl1 <- penguins %>%
  group_by(species, island, sex) %>%
  summarize(n = n()) %>%
  pivot_wider(names_from = sex, values_from = n))
#> `summarise()` has grouped output by 'species', 'island'. You can override using the `.groups` argument.
#> # A tibble: 5 x 5
#> # Groups:   species, island [5]
#>   species   island    female  male  `NA`
#>   <fct>     <fct>      <int> <int> <int>
#> 1 Adelie    Biscoe        22    22    NA
#> 2 Adelie    Dream         27    28     1
#> 3 Adelie    Torgersen     24    23     5
#> 4 Chinstrap Dream         34    34    NA
#> 5 Gentoo    Biscoe        58    61     5
(tbl2 <- tbl1 %>%
  ungroup() %>%
  rowwise(species, island) %>%
  mutate(across(everything(), ~ . / sum(c_across(everything()), na.rm = T))))
#> # A tibble: 5 x 5
#> # Rowwise:  species, island
#>   species   island    female  male   `NA`
#>   <fct>     <fct>      <dbl> <dbl>  <dbl>
#> 1 Adelie    Biscoe     0.5   0.978 NA    
#> 2 Adelie    Dream      0.482 0.950  0.411
#> 3 Adelie    Torgersen  0.462 0.808  0.797
#> 4 Chinstrap Dream      0.5   0.986 NA    
#> 5 Gentoo    Biscoe     0.468 0.918  0.783
(tbl3 <- tbl1 %>%
  ungroup() %>%
  rowwise(species, island) %>%
  mutate(across(female:male, ~ . / sum(c_across(female:male), na.rm = T))))
#> # A tibble: 5 x 5
#> # Rowwise:  species, island
#>   species   island    female  male  `NA`
#>   <fct>     <fct>      <dbl> <dbl> <int>
#> 1 Adelie    Biscoe     0.5   0.978    NA
#> 2 Adelie    Dream      0.491 0.983     1
#> 3 Adelie    Torgersen  0.511 0.978     5
#> 4 Chinstrap Dream      0.5   0.986    NA
#> 5 Gentoo    Biscoe     0.487 0.992     5

Created on 2021-02-18 by the reprex package (v1.0.0)

Session info
sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] forcats_0.5.1        stringr_1.4.0        dplyr_1.0.4         
#>  [4] purrr_0.3.4          readr_1.4.0          tidyr_1.1.2         
#>  [7] tibble_3.0.6         ggplot2_3.3.3        tidyverse_1.3.0     
#> [10] palmerpenguins_0.1.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.6        cellranger_1.1.0  compiler_4.0.4    pillar_1.4.7     
#>  [5] dbplyr_2.1.0      highr_0.8         tools_4.0.4       digest_0.6.27    
#>  [9] lubridate_1.7.9.2 jsonlite_1.7.2    evaluate_0.14     lifecycle_1.0.0  
#> [13] gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.10      reprex_1.0.0     
#> [17] cli_2.3.0         DBI_1.1.1         rstudioapi_0.13   yaml_2.2.1       
#> [21] haven_2.3.1       xfun_0.20         withr_2.4.1       xml2_1.3.2       
#> [25] httr_1.4.2        knitr_1.31        hms_1.0.0         generics_0.1.0   
#> [29] fs_1.5.0          vctrs_0.3.6       grid_4.0.4        tidyselect_1.1.0 
#> [33] glue_1.4.2        R6_2.5.0          fansi_0.4.2       readxl_1.3.1     
#> [37] rmarkdown_2.6     modelr_0.1.8      magrittr_2.0.1    scales_1.1.1     
#> [41] backports_1.2.1   ps_1.5.0          htmltools_0.5.1.1 ellipsis_0.3.1   
#> [45] rvest_0.3.6       assertthat_0.2.1  colorspace_2.0-0  utf8_1.1.4       
#> [49] stringi_1.5.3     munsell_0.5.0     broom_0.7.4       crayon_1.4.1
@ahcyip ahcyip changed the title rowwise mutate across with c_across seems to be doing something weird to some but not all the columns rowwise mutate across with c_across seems to compute incorrectly for some but not all the columns Feb 18, 2021
@ahcyip
Copy link
Contributor Author

ahcyip commented Feb 18, 2021

Expected result for tbl2:
0.5 0.5 NA
0.482 0.5, 0.018
etc.

@ahcyip
Copy link
Contributor Author

ahcyip commented Feb 18, 2021

Worked as expected in dplyr 1.0.2

image

@ahcyip ahcyip changed the title rowwise mutate across with c_across seems to compute incorrectly for some but not all the columns rowwise mutate across with c_across computes incorrectly for some but not all the columns (worked in dplyr 1.0.2 but not in 1.0.4) Feb 18, 2021
@ahcyip
Copy link
Contributor Author

ahcyip commented Feb 18, 2021

I just tried dplyr 1.0.4.9000 and it still appears to be broken. And I tried dplyr 1.0.3 and it worked fine there too.

@ahcyip ahcyip changed the title rowwise mutate across with c_across computes incorrectly for some but not all the columns (worked in dplyr 1.0.2 but not in 1.0.4) rowwise mutate across with c_across computes incorrectly for some but not all the columns (worked in dplyr 1.0.3 but not in 1.0.4) Feb 18, 2021
@hadley
Copy link
Member

hadley commented Feb 18, 2021

I suspect the problem is this: for across(everything(), ~ . / sum(c_across(everything()) to work, you are relying critically on the c_across() to be evaluated once before each of the individual divisions take place. That seems like a dangerous assumption so I’d suggest computing the sum first.

@romainfrancois
Copy link
Member

🤔 this looks like a bug in the instrumentation of across() with the internal top_across(), if we disable it with e.g. + we get expected result:

library(palmerpenguins); library(tidyverse)
(tbl1 <- penguins %>%
    group_by(species, island, sex) %>%
    summarize(n = n()) %>%
    pivot_wider(names_from = sex, values_from = n))
#> `summarise()` has grouped output by 'species', 'island'. You can override using the `.groups` argument.
#> # A tibble: 5 x 5
#> # Groups:   species, island [5]
#>   species   island    female  male  `NA`
#>   <fct>     <fct>      <int> <int> <int>
#> 1 Adelie    Biscoe        22    22    NA
#> 2 Adelie    Dream         27    28     1
#> 3 Adelie    Torgersen     24    23     5
#> 4 Chinstrap Dream         34    34    NA
#> 5 Gentoo    Biscoe        58    61     5

tbl1 %>%
  ungroup() %>%
  rowwise(species, island) %>%
  mutate(across(everything(), ~ . / sum(c_across(everything()), na.rm = T)))
#> # A tibble: 5 x 5
#> # Rowwise:  species, island
#>   species   island    female  male   `NA`
#>   <fct>     <fct>      <dbl> <dbl>  <dbl>
#> 1 Adelie    Biscoe     0.5   0.978 NA    
#> 2 Adelie    Dream      0.482 0.950  0.411
#> 3 Adelie    Torgersen  0.462 0.808  0.797
#> 4 Chinstrap Dream      0.5   0.986 NA    
#> 5 Gentoo    Biscoe     0.468 0.918  0.783

tbl1 %>%
  ungroup() %>%
  rowwise(species, island) %>%
  mutate(+across(everything(), ~ . / sum(c_across(everything()), na.rm = T)))
#> # A tibble: 5 x 5
#> # Rowwise:  species, island
#>   species   island    female  male    `NA`
#>   <fct>     <fct>      <dbl> <dbl>   <dbl>
#> 1 Adelie    Biscoe     0.5   0.5   NA     
#> 2 Adelie    Dream      0.482 0.5    0.0179
#> 3 Adelie    Torgersen  0.462 0.442  0.0962
#> 4 Chinstrap Dream      0.5   0.5   NA     
#> 5 Gentoo    Biscoe     0.468 0.492  0.0403

Created on 2021-02-18 by the reprex package (v0.3.0)

I'll have a look, another approach would be to instead take the result from across() and scale it, i.e.

library(palmerpenguins); library(tidyverse)
(tbl1 <- penguins %>%
    group_by(species, island, sex) %>%
    summarize(n = n()) %>%
    pivot_wider(names_from = sex, values_from = n))
#> `summarise()` has grouped output by 'species', 'island'. You can override using the `.groups` argument.
#> # A tibble: 5 x 5
#> # Groups:   species, island [5]
#>   species   island    female  male  `NA`
#>   <fct>     <fct>      <int> <int> <int>
#> 1 Adelie    Biscoe        22    22    NA
#> 2 Adelie    Dream         27    28     1
#> 3 Adelie    Torgersen     24    23     5
#> 4 Chinstrap Dream         34    34    NA
#> 5 Gentoo    Biscoe        58    61     5

scale <- function(x) {
  x / sum(x, na.rm = TRUE)
}

tbl1 %>%
  ungroup() %>%
  rowwise(species, island) %>%
  mutate(
    scale(across(everything()))
  )
#> # A tibble: 5 x 5
#> # Rowwise:  species, island
#>   species   island    female  male    `NA`
#>   <fct>     <fct>      <dbl> <dbl>   <dbl>
#> 1 Adelie    Biscoe     0.5   0.5   NA     
#> 2 Adelie    Dream      0.482 0.5    0.0179
#> 3 Adelie    Torgersen  0.462 0.442  0.0962
#> 4 Chinstrap Dream      0.5   0.5   NA     
#> 5 Gentoo    Biscoe     0.468 0.492  0.0403

Created on 2021-02-18 by the reprex package (v0.3.0)

@romainfrancois romainfrancois self-assigned this Feb 18, 2021
@romainfrancois romainfrancois added the bug an unexpected problem or unintended behavior label Feb 18, 2021
@romainfrancois romainfrancois added this to the 1.0.5 milestone Feb 18, 2021
@romainfrancois
Copy link
Member

Actually, there seems to be a confusion between across() (which would trigger expanding the expressions) and +across() which would not:

library(dplyr, warn.conflicts = FALSE)
df <- tibble(x = 2, y = 4, z = 8)

df %>% mutate_all(~ .x / y)
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   0.5     1     8
df %>% mutate(across(everything(), ~ .x / y))
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   0.5     1     8
# these do:
df %>% mutate(x = x / y, y = y / y, z = z / y)
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   0.5     1     8

df %>% mutate(+across(everything(), ~ .x / y))
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   0.5     1     2

# but this does:
df %>% mutate(data.frame(x = x / y, y = y / y, z = z / y))
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   0.5     1     2

Created on 2021-02-18 by the reprex package (v0.3.0)

I believe that the last results are correct, and that the instrumented across() needs fixing, as well as some tests apparently 😱.

@ahcyip
Copy link
Contributor Author

ahcyip commented Feb 18, 2021

Thanks for addressing and fixing this!
Since I'm pivoting anyway, I should probably stick to regular group_by for now
group_by(a,b,c) %>% summarize(x/sum(x))
But looking forward to a future with a mature rowwise / group_by_row :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants