median and mean within pmap do not respect na.rm = TRUE when first element is NA #790

dhslone · 2020-08-23T17:34:07Z

I searched github issues and this may be related to #751

When passing na.rm = TRUE to mean or median within pmap, if the first element is NA then the result is NA. Using apply shows the expected behavior. sum, max, min all agree between pmap and apply

library(tidyverse)

df <- tribble(
  ~a, ~b, ~c,
  1, 2, 3,
  4, 5, 6,
  NA, 2, NA,
  2, NA, NA,
  NA, NA, NA
)
df %>% mutate(pm_med = pmap_dbl(list(a, b, c), median, na.rm = TRUE),
              ap_med = apply(select(df,c(a, b, c)), 1, median, na.rm = TRUE),
              pm_sum = pmap_dbl(list(a, b, c), sum, na.rm = TRUE),
              ap_sum = apply(select(df,c(a, b, c)), 1, sum, na.rm = TRUE))
#> # A tibble: 5 x 7
#>       a     b     c pm_med ap_med pm_sum ap_sum
#>   <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1     1     2     3      1      2      6      6
#> 2     4     5     6      4      5     15     15
#> 3    NA     2    NA     NA      2      2      2
#> 4     2    NA    NA      2      2      2      2
#> 5    NA    NA    NA     NA     NA      0      0

shringi · 2021-03-06T16:19:19Z

@dhslone
I noticed that the calculation of median inside the pmap_dbl is wrong even if there were no NA present in a row.
For example, in the first row of df, the results of pm_med should be 2, instead of 1, as median(c(1,2,3)) is 2. This is correctly calculated in ap_med column. The result is essentially returning the first row as a result.
Following is the same reprex but with non NA rows:

df %>%
slice(1:2) %>%
mutate(pm_med = pmap_dbl(list(a, b, c), median, na.rm = TRUE),
       pm_mean = pmap_dbl(list(a, b, c), mean, na.rm = TRUE),
       pm_sum = pmap_dbl(list(a, b, c), sum, na.rm = TRUE))
#> # A tibble: 2 x 6
#>       a     b     c pm_med pm_mean pm_sum
#>   <dbl> <dbl> <dbl>  <dbl>   <dbl>  <dbl>
#> 1     1     2     3      1       1      6
#> 2     4     5     6      4       4     15

You can see above that the pm_med and pm_mean columns just returned the first column as it is.
It is very similar to the difference between median(1,2,3) vs median(c(1,2,3)).

I think users need to be careful where to put the na.rm = TRUE inside the pmap function.
I am providing further separate examples for median, mean, and sum for a comparison to understand this issue with dealing with NA values. I added some of the na.rm = TRUE at various places to check the expected vs unexpected results just to highlight the differences.

# Median
df %>% mutate(ap_med = apply(select(df,c(a, b, c)), 1, median, na.rm = TRUE),
              pm_med = pmap_dbl(list(a, b, c), median, na.rm = TRUE),
              pm_med.1 = pmap_dbl(list(a, b, c), median, na.rm = FALSE),
              pm_med.2 = pmap_dbl(list(a, b, c), ~median(c(...))),
              pm_med.3 = pmap_dbl(list(a, b, c), ~median(c(...)), na.rm = TRUE),
              pm_med.4 = pmap_dbl(list(a, b, c), ~median(c(...), na.rm = TRUE)),
              pm_med.5 = pmap_dbl(list(a, b, c), ~median(c(...), na.rm = TRUE), na.rm = TRUE))
#> # A tibble: 5 x 10
#>       a     b     c ap_med pm_med pm_med.1 pm_med.2 pm_med.3 pm_med.4 pm_med.5
#>   <dbl> <dbl> <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1     1     2     3      2      1        1        2      1.5        2      1.5
#> 2     4     5     6      5      4        4        5      4.5        5      4.5
#> 3    NA     2    NA      2     NA       NA       NA     NA          2      1.5
#> 4     2    NA    NA      2      2        2       NA     NA          2      1.5
#> 5    NA    NA    NA     NA     NA       NA       NA     NA         NA      1

# Mean
df %>% mutate(ap_mean = apply(select(df,c(a, b, c)), 1, mean, na.rm = TRUE),
 #pm_mean = pmap_dbl(list(a, b, c), mean, na.rm = TRUE),    # Doesn't work unlike median
 #pm_mean.1 = pmap_dbl(list(a, b, c), mean, na.rm = FALSE), # Doesn't work unlike median
  pm_mean.2 = pmap_dbl(list(a, b, c), ~mean(c(...))),
  pm_mean.3 = pmap_dbl(list(a, b, c), ~mean(c(...)), na.rm = TRUE),
  pm_mean.4 = pmap_dbl(list(a, b, c), ~mean(c(...), na.rm = TRUE)),
  pm_mean.5 = pmap_dbl(list(a, b, c), ~mean(c(...), na.rm = TRUE), na.rm = TRUE))
#> # A tibble: 5 x 8
#>       a     b     c ap_mean pm_mean.2 pm_mean.3 pm_mean.4 pm_mean.5
#>   <dbl> <dbl> <dbl>   <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#> 1     1     2     3       2         2      1.75         2      1.75
#> 2     4     5     6       5         5      4            5      4   
#> 3    NA     2    NA       2        NA     NA            2      1.5 
#> 4     2    NA    NA       2        NA     NA            2      1.5 
#> 5    NA    NA    NA     NaN        NA     NA          NaN      1

# Sum
df %>% mutate(ap_sum = apply(select(df,c(a, b, c)), 1, sum, na.rm = TRUE),
              pm_sum = pmap_dbl(list(a, b, c), sum, na.rm = TRUE),
              pm_sum.1 = pmap_dbl(list(a, b, c), sum, na.rm = FALSE),
              pm_sum.2 = pmap_dbl(list(a, b, c), ~sum(c(...))),
              pm_sum.3 = pmap_dbl(list(a, b, c), ~sum(c(...)), na.rm = TRUE),
              pm_sum.4 = pmap_dbl(list(a, b, c), ~sum(c(...), na.rm = TRUE)),
              pm_sum.5 = pmap_dbl(list(a, b, c), ~sum(c(...), na.rm = TRUE), na.rm = TRUE))
#> # A tibble: 5 x 10
#>       a     b     c ap_sum pm_sum pm_sum.1 pm_sum.2 pm_sum.3 pm_sum.4 pm_sum.5
#>   <dbl> <dbl> <dbl>  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1     1     2     3      6      6        6        6        7        6        7
#> 2     4     5     6     15     15       15       15       16       15       16
#> 3    NA     2    NA      2      2       NA       NA       NA        2        3
#> 4     2    NA    NA      2      2       NA       NA       NA        2        3
#> 5    NA    NA    NA      0      0       NA       NA       NA        0        1

I feel this needs to be properly highlighted in the documentation with examples where it can go wrong.

Created on 2021-03-06 by the reprex package (v1.0.0)

dhslone · 2021-03-07T19:36:34Z

@shringi

I was so focused on the NA behavior that I did not notice the other problems! I have been using the more verbose ~ formulation as much as possible, and you are reinforcing that. I prefer things to break rather than silently give an unexpected result.

hadley · 2022-08-24T10:26:42Z

Yes, this is an unfortunate problem with median:

median(1, 2, 3, 4)
#> [1] 1

^{Created on 2022-08-24 by the reprex package (v2.0.1)}

hadley closed this as completed Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

median and mean within pmap do not respect na.rm = TRUE when first element is NA #790

median and mean within pmap do not respect na.rm = TRUE when first element is NA #790

dhslone commented Aug 23, 2020

shringi commented Mar 6, 2021

dhslone commented Mar 7, 2021

hadley commented Aug 24, 2022 •

edited

Loading

median and mean within pmap do not respect na.rm = TRUE when first element is NA #790

median and mean within pmap do not respect na.rm = TRUE when first element is NA #790

Comments

dhslone commented Aug 23, 2020

shringi commented Mar 6, 2021

dhslone commented Mar 7, 2021

hadley commented Aug 24, 2022 • edited Loading

hadley commented Aug 24, 2022 •

edited

Loading