Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table::if_else() or data.table::iif() #3657

Closed
shrektan opened this issue Jun 25, 2019 · 12 comments · Fixed by #3678
Closed

data.table::if_else() or data.table::iif() #3657

shrektan opened this issue Jun 25, 2019 · 12 comments · Fixed by #3678

Comments

@shrektan
Copy link
Member

shrektan commented Jun 25, 2019

base::ifelse() is notorious slow and will change the attributes somehow.

ifelse(TRUE, as.Date('2019-06-25'), as.Date('2019-06-25'))
# [1] 18072

dplyr implements dplyr::if_else() which I think is good. However, I always reluctant to use dplyr because of it's heavy dependencies.

So I'm proposing a feature request here... I think both data.table::if_else() and data.table::iif() are good name candidates (though the former conflicts with dplyr::if_else())

@jangorecki jangorecki changed the title [FR] data.table::if_else() or data.table::iif() data.table::if_else() or data.table::iif() Jun 26, 2019
@jangorecki
Copy link
Member

alternatively fifelse

@HughParsonage
Copy link
Member

For comparison, hutils::if_else

from the description:

Lightweight dplyr::if_else with the virtues and vices that come from such an approach.

@2005m
Copy link
Contributor

2005m commented Jul 1, 2019

I have shared with Matt a prototype implemented in C .
It has good performance without using openmp. (Up to x10 faster than base::ifelse and dplyr::if_else)
I also have a version that uses openmp which is useful when inputs are large.
I have tried to stick to the philosophy of the data.table package as much as I am aware of it.
The prototype is compatible with R 3.1.0 and has no dependency.

Now, base::ifelse and dplyr::if_else have different behaviours.
I am trying to list all the use cases of a "fifelse" function and check which behaviours is wanted and which is not.Below you will find some examples.

Example 1: (base::ifelse does not keep dates format)

dates <- as.Date(c("2011-01-01","2011-01-02","2011-01-03","2011-01-04","2011-01-05"))

dplyr::if_else(dates == "2011-01-01", dates - 1, dates)
#[1] "2010-12-31" "2011-01-02" "2011-01-03" "2011-01-04" "2011-01-05"

base::ifelse(dates == "2011-01-01", dates - 1, dates)
#[1] 14974 14976 14977 14978 14979

Which behaviour do we want? My prototype currently outputs the same as dplyr which in
my view makes more sense but I can add a flag to have the behaviour of base::ifelse as well?

Example 2: (base::ifelse does not keep factor class)

v <- factor(sample(letters[1:5], 10, replace = TRUE))

dplyr::if_else(v %in% c("a","b","c"), v, factor(NA))
#[1] <NA> c    b    <NA> <NA> b    c    c    c    <NA>
# Levels: b c d e

base::ifelse(v %in% c("a","b","c"), v, factor(NA))
#[1] NA  2  1 NA NA  1  2  2  2 NA

Again, dplyr::if_else makes more sense to me but nothing prevents us from having
R base behaviour with a flag?

Example 3:

z <- factor(1:3, labels = letters[1:3])

dplyr::if_else(z == "a", z[[2]], list(z))
#Error: `false` must be a `factor` object, not a list
#Call `rlang::last_error()` to see a backtrace

base::ifelse(z == "a", z[[2]], list(z))
#[[1]]
#[1] 2
# 
#[[2]]
#[1] a b c
#Levels: a b c
# 
#[[3]]
#[1] a b c
#Levels: a b c

In the above example, dplyr::if_else returns an error however base::ifelse returns a list of integer and factors which I do not believe is wright. I am not sure what makes more sense from the below options? (please feel free to suggest something)

Option 1:( equivalent to dplyr::if_else(z == "a", list(z[[2]]), list(z)) )

#[[1]]
#[1] b
#Levels: a b c
# 
#[[2]]
#[1] a b c
#Levels: a b c
# 
#[[3]]
#[1] a b c
#Levels: a b c

Option 2:

#[[1]]
#[1] 2
# 
#[[2]]
#[1] 1 2 3
# 
#[[3]]
#[1] 1 2 3

Option 3:

#[[1]]
#[1] b
#Levels: a b c
# 
#[[2]]
#[1] b
#Levels: a b c
# 
#[[3]]
#[1] c
#Levels: a b c

Option 4:

#[[1]]
#[1] 2
# 
#[[2]]
#[1] 2
# 
#[[3]]
#[1] 3

Option 1 can be turned into option 2 with a flag, so does option 3 with 4.
Or do we want all 4 options?

Example 4:
The below one is a mystery to me and I did not investigate why it has this behaviour...
(dplyr::if_else returns an error)

base::ifelse(c(-5L:5L) < 0, list(1), list(2, 3))
# [[1]]
# [1] 1
# 
# [[2]]
# [1] 1
# 
# [[3]]
# [1] 1
# 
# [[4]]
# [1] 1
# 
# [[5]]
# [1] 1
# 
# [[6]]
# [1] 3
# 
# [[7]]
# [1] 2
# 
# [[8]]
# [1] 3
# 
# [[9]]
# [1] 2
# 
# [[10]]
# [1] 3
# 
# [[11]]
# [1] 2

Please let me know your thoughts.

@MichaelChirico
Copy link
Member

MichaelChirico commented Jul 1, 2019

Example 1 should definitely retain Date class. Ditto Example 2, but there's an extremely thorny issue of when !identical(levels(true_vals), levels(false_vals)). We dodged this in coalesce by forcing the levels of all replacement options to be identical, but this is slightly restrictive and slightly inconvenient.

Not to mention that, in a grouping context we have to be extra careful...

Example 3 looks similar?

Example 4 also doesn't make sense to me.

I'm not sure how much attention we should devote to handling list output just yet, in any case.

Please proceed to file a PR -- that way all members could have a glance at the code and help move things forward from there.

@MichaelChirico
Copy link
Member

Related: #673 and #2677 (of which this current issue is nearly a duplicate)

Please note also the discussion in #2677 about possibly patching ifelse into base R and the related discussion on r-devel about why not to do that so hastily, it should help inform the discussion of edge cases for any data.table implementation

@2005m
Copy link
Contributor

2005m commented Jul 1, 2019

Thank you for these useful links. It shows the need for a faster and more robust implementation. I will study them and share my code. Please be patient as I am working on this in my spare time. Thank you.

@hadley
Copy link
Contributor

hadley commented Jul 1, 2019

You may also want consider vctrs::if_else(), https://vctrs.r-lib.org/articles/stability.html#ifelse, specifically for the thinking about what types ifelse "should" return.

@2005m
Copy link
Contributor

2005m commented Jul 1, 2019

Give that a try and let me know your thoughts: devtools::install_github("2005m/data.table",ref = "fifelse")
FYI I am not using openmp (in this version).
Critics, feedback and recommendations are welcome.
Please find below some benchmarks.
@hadley , sorry I did not look at vctrs::if_else, but feel free to provide benchmarks...

x   <- c(-5e4L:5e4L) < 0
len <- length(x)

microbenchmark::microbenchmark(
  data.table::fifelse(x, 1L, 0L),
  ifelse(x, 1L, 0L),
  dplyr::if_else(x, 1L, 0L),
  hutils::if_else(x, 1L, 0L),
  times = 100L
)
# Unit: microseconds
#                            expr      min        lq      mean    median        uq       max neval
#  data.table::fifelse(x, 1L, 0L)  151.380  190.0810  376.5134  264.4880  345.0960  7378.278   100
#               ifelse(x, 1L, 0L) 2271.984 2340.8315 3499.8378 3129.3760 3502.9080 18044.582   100
#       dplyr::if_else(x, 1L, 0L) 2422.935 2460.3535 4257.2123 3459.5040 4307.9165 25538.319   100
#      hutils::if_else(x, 1L, 0L)  610.652  643.3655  838.1112  729.7465  925.8135  1919.191   100

microbenchmark::microbenchmark(
  data.table::fifelse(x, rep(1,len), rep(0,len)),
  ifelse(x, rep(1,len), rep(0,len)),
  dplyr::if_else(x, rep(1,len), rep(0,len)),
  hutils::if_else(x, rep(1,len), rep(0,len)),
  times = 100L
)
# Unit: microseconds
#                                              expr      min        lq     mean   median       uq      max neval
#  data.table::fifelse(x, rep(1, len), rep(0, len))  668.382  710.2895 1336.228  746.210 1281.813 14152.75   100
#               ifelse(x, rep(1, len), rep(0, len)) 2812.932 2841.3690 4172.893 2892.256 5575.617 19014.44   100
#       dplyr::if_else(x, rep(1, len), rep(0, len)) 3899.104 3930.7490 6652.899 3973.726 7177.721 82113.81   100
#      hutils::if_else(x, rep(1, len), rep(0, len)) 1364.559 1392.1410 1942.565 1419.723 1830.886  5875.17   100

microbenchmark::microbenchmark(
  data.table::fifelse(x, TRUE, FALSE),
  ifelse(x, TRUE, FALSE),
  dplyr::if_else(x, TRUE, FALSE),
  hutils::if_else(x, TRUE, FALSE),
  times = 100L
)
# Unit: microseconds
#                                 expr      min        lq      mean    median       uq       max neval
#  data.table::fifelse(x, TRUE, FALSE)  150.953  164.6365  327.2636  195.6395  337.826  4358.804   100
#               ifelse(x, TRUE, FALSE) 1945.704 1980.7695 2669.5270 2010.2765 2760.974 19098.682   100
#       dplyr::if_else(x, TRUE, FALSE) 2425.502 2444.3170 4363.1057 2475.7480 4892.696 69946.111   100
#      hutils::if_else(x, TRUE, FALSE)  606.803  639.7305  829.6400  658.9740  934.152  2741.091   100

microbenchmark::microbenchmark(
  data.table::fifelse(x, "TRUE", "FALSE"),
  ifelse(x, "TRUE", "FALSE"),
  dplyr::if_else(x, "TRUE", "FALSE"),
  hutils::if_else(x, "TRUE", "FALSE"),
  times = 100L
)
# Unit: milliseconds
#                                     expr       min        lq      mean    median        uq       max neval
#  data.table::fifelse(x, "TRUE", "FALSE")  1.038706  1.071206  1.205268  1.089809  1.147752  5.927340   100
#               ifelse(x, "TRUE", "FALSE") 36.581362 37.241832 41.123822 37.802878 39.733828 74.862540   100
#       dplyr::if_else(x, "TRUE", "FALSE")  5.152052  5.236294  7.133082  5.385750  6.709898 76.003022   100
#      hutils::if_else(x, "TRUE", "FALSE")  1.286303  1.323079  1.574507  1.345315  1.459065  5.457805   100

dates <- as.Date(rep(c("2011-01-01","2011-01-02","2011-01-03",
                       "2011-01-04","2011-01-05"),1e4))

# WARNING: base ifelse does not produce expected result below
microbenchmark::microbenchmark(
  data.table::fifelse(dates == "2011-01-01", dates - 1, dates),
  ifelse(dates == "2011-01-01", dates - 1, dates),
  dplyr::if_else(dates == "2011-01-01", dates - 1, dates),
  hutils::if_else(dates == "2011-01-01", dates - 1, dates),
  times = 100L
)
# Unit: microseconds
#                                                          expr      min        lq     mean   median        uq       max neval
#  data.table::fifelse(dates == "2011-01-01", dates - 1, dates)  785.552  840.5015 2452.754  882.837  912.7705  68726.52   100
#               ifelse(dates == "2011-01-01", dates - 1, dates) 2696.617 2728.4755 3849.316 2764.609 2801.3855  82019.74   100
#       dplyr::if_else(dates == "2011-01-01", dates - 1, dates) 2686.782 2722.2745 3870.595 2772.307 2845.2175  69898.22   100
#      hutils::if_else(dates == "2011-01-01", dates - 1, dates) 1113.114 1140.4815 4477.248 1187.521 1204.4120 180626.27   100

v <- rep(factor(sample(letters[1:5], 10, replace = TRUE)),1e4)

# WARNING: base ifelse does not produce expected result below
# hutils::if_else excluded as Error
microbenchmark::microbenchmark(
  data.table::fifelse(v %in% c("a","b","c"), v, factor(NA)),
  ifelse(v %in% c("a","b","c"), v, factor(NA)),
  dplyr::if_else(v %in% c("a","b","c"), v, factor(NA)),
  times = 100L
)
# Unit: milliseconds
#                                                         expr      min       lq      mean   median        uq      max neval
#  data.table::fifelse(v %in% c("a", "b", "c"), v, factor(NA)) 7.087920 7.130255  7.752777 7.151637  7.294678 12.50852   100
#               ifelse(v %in% c("a", "b", "c"), v, factor(NA)) 7.214926 7.250846  7.852812 7.294037  7.517257 11.98126   100
#       dplyr::if_else(v %in% c("a", "b", "c"), v, factor(NA)) 9.196122 9.269673 13.034460 9.358833 12.553208 88.21562   100

# hutils::if_else excluded as Error
microbenchmark::microbenchmark(
  data.table::fifelse(x, list(1),list(2)),
  ifelse(x, list(1),list(2)),
  dplyr::if_else(x, list(1),list(2)),
  times = 100L
)
# Unit: milliseconds
#                                      expr      min       lq     mean   median       uq      max neval
#  data.table::fifelse(x, list(1), list(2)) 2.632473 2.724840 4.616436 2.868095 3.658351 40.04899   100
#               ifelse(x, list(1), list(2)) 6.379341 6.474061 7.311804 6.523452 7.431518 12.85960   100
#       dplyr::if_else(x, list(1), list(2)) 5.710961 5.763131 7.263085 5.784512 6.538633 28.54668   100

# hutils::if_else excluded as Error
microbenchmark::microbenchmark(
  data.table::fifelse(x, list(FALSE),list(rep(TRUE,len))),
  ifelse(x, list(FALSE),list(rep(TRUE,len))),
  dplyr::if_else(x, list(FALSE),list(rep(TRUE,len))),
  times = 100L
)
# Unit: milliseconds
#                                                       expr      min       lq     mean   median       uq       max neval
#  data.table::fifelse(x, list(FALSE), list(rep(TRUE, len))) 1.826824 1.853978 2.149058 1.865310 1.909784  7.311997   100
#               ifelse(x, list(FALSE), list(rep(TRUE, len))) 6.612826 6.678039 7.626645 6.700062 7.266668 18.867336   100
#       dplyr::if_else(x, list(FALSE), list(rep(TRUE, len))) 5.928195 5.988063 7.035108 6.031895 7.762929 15.979142   100

df <- do.call(rbind, replicate(2000, mtcars, simplify=FALSE))

microbenchmark::microbenchmark(
  data.table::fifelse(df$mpg < 19.5, TRUE, FALSE),
  ifelse(df$mpg < 19.5, TRUE, FALSE),
  dplyr::if_else(df$mpg < 19.5, TRUE, FALSE),
  hutils::if_else(df$mpg < 19.5, TRUE, FALSE),
  times = 100L
)
# Unit: microseconds
#                                             expr      min        lq      mean    median        uq       max neval
#  data.table::fifelse(df$mpg < 19.5, TRUE, FALSE)  506.311  526.6235  933.7458  555.2740  685.7005 16243.417   100
#               ifelse(df$mpg < 19.5, TRUE, FALSE) 1639.523 1667.3180 2232.3723 1682.7135 2004.9305 19018.289   100
#       dplyr::if_else(df$mpg < 19.5, TRUE, FALSE) 2008.993 2035.9335 3488.2319 2069.7155 2824.0500 81675.495   100
#      hutils::if_else(df$mpg < 19.5, TRUE, FALSE)  811.210  826.1760  995.5208  854.1855  892.6725  6705.621   100

microbenchmark::microbenchmark(
  data.table::fifelse(df$mpg < 19.5, df$cyl, df$drat),
  ifelse(df$mpg < 19.5, df$cyl, df$drat),
  dplyr::if_else(df$mpg < 19.5, df$cyl, df$drat),
  hutils::if_else(df$mpg < 19.5, df$cyl, df$drat),
  times = 100L
)
# Unit: microseconds
#                                                 expr      min       lq      mean   median        uq       max neval
#  data.table::fifelse(df$mpg < 19.5, df$cyl, df$drat)  569.172  674.796  893.9038  699.385  750.2725  5530.930   100
#               ifelse(df$mpg < 19.5, df$cyl, df$drat) 1976.921 2024.815 2736.2667 2041.279 2115.4720 16723.641   100
#       dplyr::if_else(df$mpg < 19.5, df$cyl, df$drat) 2779.577 2841.155 4848.6506 2896.533 4860.1970 76190.750   100
#      hutils::if_else(df$mpg < 19.5, df$cyl, df$drat) 1078.904 1146.041 1300.5216 1183.887 1206.3365  5951.715   100

@jangorecki
Copy link
Member

Looks interesting.
Maybe we could generalise that into fcase, then fifelse is just a special case of fcase, same internals could handle both cases.
@2005m when opening PR, ideally do from a branch in Rdatatable org, so we all can easily push into it.

@MichaelChirico
Copy link
Member

@jangorecki good point, was also thinking of how an API for CASE WHEN might look like in data.table, as IF in SQL is AFAIK usually just implemented as a wrapper around CASE.

I guess @mattdowle would have to grant branching access for the branch to be done within Rdatatable.

@g3o2
Copy link

g3o2 commented Jul 12, 2019

@jangorecki good point, was also thinking of how an API for CASE WHEN might look like in data.table, as IF in SQL is AFAIK usually just implemented as a wrapper around CASE

Currently, the data.table way is chaining [i, j with :=] for each case. This is actually a pretty good user-friendly interface, though I do not know about the potential impacts on performance.

@mattdowle mattdowle added this to the 1.12.4 milestone Jul 30, 2019
mattdowle pushed a commit that referenced this issue Aug 2, 2019
@mattdowle
Copy link
Member

Closed by #3678 (forgot to use 'closes' prefix in PR comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants