Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

median and mean within pmap do not respect na.rm = TRUE when first element is NA #790

Closed
dhslone opened this issue Aug 23, 2020 · 3 comments

Comments

@dhslone
Copy link

dhslone commented Aug 23, 2020

I searched github issues and this may be related to #751

When passing na.rm = TRUE to mean or median within pmap, if the first element is NA then the result is NA. Using apply shows the expected behavior. sum, max, min all agree between pmap and apply

library(tidyverse)

df <- tribble(
  ~a, ~b, ~c,
  1, 2, 3,
  4, 5, 6,
  NA, 2, NA,
  2, NA, NA,
  NA, NA, NA
)
df %>% mutate(pm_med = pmap_dbl(list(a, b, c), median, na.rm = TRUE),
              ap_med = apply(select(df,c(a, b, c)), 1, median, na.rm = TRUE),
              pm_sum = pmap_dbl(list(a, b, c), sum, na.rm = TRUE),
              ap_sum = apply(select(df,c(a, b, c)), 1, sum, na.rm = TRUE))
#> # A tibble: 5 x 7
#>       a     b     c pm_med ap_med pm_sum ap_sum
#>   <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1     1     2     3      1      2      6      6
#> 2     4     5     6      4      5     15     15
#> 3    NA     2    NA     NA      2      2      2
#> 4     2    NA    NA      2      2      2      2
#> 5    NA    NA    NA     NA     NA      0      0
@shringi
Copy link

shringi commented Mar 6, 2021

@dhslone
I noticed that the calculation of median inside the pmap_dbl is wrong even if there were no NA present in a row.
For example, in the first row of df, the results of pm_med should be 2, instead of 1, as median(c(1,2,3)) is 2. This is correctly calculated in ap_med column. The result is essentially returning the first row as a result.
Following is the same reprex but with non NA rows:

df %>%
slice(1:2) %>%
mutate(pm_med = pmap_dbl(list(a, b, c), median, na.rm = TRUE),
       pm_mean = pmap_dbl(list(a, b, c), mean, na.rm = TRUE),
       pm_sum = pmap_dbl(list(a, b, c), sum, na.rm = TRUE))
#> # A tibble: 2 x 6
#>       a     b     c pm_med pm_mean pm_sum
#>   <dbl> <dbl> <dbl>  <dbl>   <dbl>  <dbl>
#> 1     1     2     3      1       1      6
#> 2     4     5     6      4       4     15

You can see above that the pm_med and pm_mean columns just returned the first column as it is.
It is very similar to the difference between median(1,2,3) vs median(c(1,2,3)).

I think users need to be careful where to put the na.rm = TRUE inside the pmap function.
I am providing further separate examples for median, mean, and sum for a comparison to understand this issue with dealing with NA values. I added some of the na.rm = TRUE at various places to check the expected vs unexpected results just to highlight the differences.

# Median
df %>% mutate(ap_med = apply(select(df,c(a, b, c)), 1, median, na.rm = TRUE),
              pm_med = pmap_dbl(list(a, b, c), median, na.rm = TRUE),
              pm_med.1 = pmap_dbl(list(a, b, c), median, na.rm = FALSE),
              pm_med.2 = pmap_dbl(list(a, b, c), ~median(c(...))),
              pm_med.3 = pmap_dbl(list(a, b, c), ~median(c(...)), na.rm = TRUE),
              pm_med.4 = pmap_dbl(list(a, b, c), ~median(c(...), na.rm = TRUE)),
              pm_med.5 = pmap_dbl(list(a, b, c), ~median(c(...), na.rm = TRUE), na.rm = TRUE))
#> # A tibble: 5 x 10
#>       a     b     c ap_med pm_med pm_med.1 pm_med.2 pm_med.3 pm_med.4 pm_med.5
#>   <dbl> <dbl> <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1     1     2     3      2      1        1        2      1.5        2      1.5
#> 2     4     5     6      5      4        4        5      4.5        5      4.5
#> 3    NA     2    NA      2     NA       NA       NA     NA          2      1.5
#> 4     2    NA    NA      2      2        2       NA     NA          2      1.5
#> 5    NA    NA    NA     NA     NA       NA       NA     NA         NA      1

# Mean
df %>% mutate(ap_mean = apply(select(df,c(a, b, c)), 1, mean, na.rm = TRUE),
 #pm_mean = pmap_dbl(list(a, b, c), mean, na.rm = TRUE),    # Doesn't work unlike median
 #pm_mean.1 = pmap_dbl(list(a, b, c), mean, na.rm = FALSE), # Doesn't work unlike median
  pm_mean.2 = pmap_dbl(list(a, b, c), ~mean(c(...))),
  pm_mean.3 = pmap_dbl(list(a, b, c), ~mean(c(...)), na.rm = TRUE),
  pm_mean.4 = pmap_dbl(list(a, b, c), ~mean(c(...), na.rm = TRUE)),
  pm_mean.5 = pmap_dbl(list(a, b, c), ~mean(c(...), na.rm = TRUE), na.rm = TRUE))
#> # A tibble: 5 x 8
#>       a     b     c ap_mean pm_mean.2 pm_mean.3 pm_mean.4 pm_mean.5
#>   <dbl> <dbl> <dbl>   <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#> 1     1     2     3       2         2      1.75         2      1.75
#> 2     4     5     6       5         5      4            5      4   
#> 3    NA     2    NA       2        NA     NA            2      1.5 
#> 4     2    NA    NA       2        NA     NA            2      1.5 
#> 5    NA    NA    NA     NaN        NA     NA          NaN      1

# Sum
df %>% mutate(ap_sum = apply(select(df,c(a, b, c)), 1, sum, na.rm = TRUE),
              pm_sum = pmap_dbl(list(a, b, c), sum, na.rm = TRUE),
              pm_sum.1 = pmap_dbl(list(a, b, c), sum, na.rm = FALSE),
              pm_sum.2 = pmap_dbl(list(a, b, c), ~sum(c(...))),
              pm_sum.3 = pmap_dbl(list(a, b, c), ~sum(c(...)), na.rm = TRUE),
              pm_sum.4 = pmap_dbl(list(a, b, c), ~sum(c(...), na.rm = TRUE)),
              pm_sum.5 = pmap_dbl(list(a, b, c), ~sum(c(...), na.rm = TRUE), na.rm = TRUE))
#> # A tibble: 5 x 10
#>       a     b     c ap_sum pm_sum pm_sum.1 pm_sum.2 pm_sum.3 pm_sum.4 pm_sum.5
#>   <dbl> <dbl> <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1     1     2     3      6      6        6        6        7        6        7
#> 2     4     5     6     15     15       15       15       16       15       16
#> 3    NA     2    NA      2      2       NA       NA       NA        2        3
#> 4     2    NA    NA      2      2       NA       NA       NA        2        3
#> 5    NA    NA    NA      0      0       NA       NA       NA        0        1

I feel this needs to be properly highlighted in the documentation with examples where it can go wrong.

Created on 2021-03-06 by the reprex package (v1.0.0)

@dhslone
Copy link
Author

dhslone commented Mar 7, 2021

@shringi

I was so focused on the NA behavior that I did not notice the other problems! I have been using the more verbose ~ formulation as much as possible, and you are reinforcing that. I prefer things to break rather than silently give an unexpected result.

@hadley
Copy link
Member

hadley commented Aug 24, 2022

Yes, this is an unfortunate problem with median:

median(1, 2, 3, 4)
#> [1] 1

Created on 2022-08-24 by the reprex package (v2.0.1)

@hadley hadley closed this as completed Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants