Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warn or optimize usage of ifelse() #2677

Closed
mattdowle opened this issue Mar 15, 2018 · 9 comments
Closed

warn or optimize usage of ifelse() #2677

mattdowle opened this issue Mar 15, 2018 · 9 comments
Milestone

Comments

@mattdowle
Copy link
Member

mattdowle commented Mar 15, 2018

Not sure the status of ifelse() speed in R these days. Years ago it used to be really slow and I avoided it. Is that still the case?
I've seen this idiom quite a bit in the wild :

DT[, newCol := ifelse(existingCol == 42,  1L, 2L)]

under verbose=TRUE there's a message about the plonk. It's slow and RAM inefficient I assume (haven't tested).

Better would be using i, as per #2676 :

DT[existingCol==42, newCol:=1L, fill=2L]

Nested ifelse(ifelse(ifelse())) also quite common. Either detect and warn about ifelse() usage and point to a document about how to avoid, or optimize it automatically. And if ifelse() in base is still slow, see why and try to improve it there.

@MichaelChirico
Copy link
Member

Just re-ran Ricardo's benchmark here and yes, it's still slow:

https://stackoverflow.com/questions/16275149/does-ifelse-really-calculate-both-of-its-vectors-every-time-is-it-slow

Isn't this just another use case for implementing more flexible nomatch (#857)?

The multiple-ifelse case should probably correspond to a join, IMO. Not sure if there's a concise way we can kludge that into existing [ code intuitively.

@HughParsonage
Copy link
Member

HughParsonage commented Mar 16, 2018

I wrote an alternative to dplyr::if_else (itself quite a bit faster than ifelse) that might be a more stringent benchmark.

library(microbenchmark)
library(magrittr)
cnd <- sample(c(TRUE, FALSE), size = 100e3, replace = TRUE)
yes <- sample(letters, size = 100e3, replace = TRUE)
no <- sample(letters, size = 100e3, replace = TRUE)
na <- sample(letters, size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::if_else(cnd, yes, no, na), base = ifelse(cnd, yes, no),
               hutils = hutils::if_else(cnd, yes, no, na),
               check = my_check) %T>%
    print %>%
    autoplot

image

Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval cld
  dplyr  6.202732  6.931278  7.584452  7.541011  8.322263 10.836334   100  b 
   base 27.949490 28.777124 29.843440 29.234310 30.110884 47.372671   100   c
 hutils  1.816998  2.041676  2.192027  2.135944  2.211239  5.622667   100 a  

@mattdowle
Copy link
Member Author

mattdowle commented Mar 16, 2018

Hugh - that's fantastic! Would you consider proposing your hutils::ifelse() as a patch to base R?

@HughParsonage
Copy link
Member

Really? Haha I only wrote it because I liked dplyr::if_else safety but dplyr's compile times for it were too long to justify importing the whole package. I actually never expected it to be faster.

There are a few limitations -- namely I'm not sure how it works on more exotic classes like if_else(matrix, list, list) for example. If you think it's worth pursuing, I'll definitely offer it as a patch.

@mattdowle
Copy link
Member Author

mattdowle commented Mar 17, 2018

It's really really worth pursuing -- yes. R is kinda famous for its slow ifelse() and desperately needs to be fixed. But yes indeed -- to have any chance of being accepted it would have to be a complete drop-in replacement. Which is the reason fread and fwrite haven't been proposed, for example. And in data.table of course, the reason is we wanted [.data.frame to work differently to be not-the-same as base R. But ifelse() speed is more like sorting ... it should just be in base. And how many of the 10,000 packages use base::ifelse(). Probably quite a lot. Any automatic improvement there with no code changes needed by package maintainers is sure to be welcomed as it could speed up CRAN testing.

Scale up the data size though so that the times are many seconds or minutes. As you know, I tend to dismiss any chart that is in milliseconds scale.

One of the R manuals explains how to build R and run its tests. If you've done that already in your very first communication, it has a much better chance.

@HughParsonage
Copy link
Member

Longer version with a 2 billion-length cnd etc.

> system.time(hutils::if_else(cnd, yes, no))
   user  system elapsed 
  55.26    4.19   59.75 
> system.time(hutils::if_else(cnd, yes, no))
   user  system elapsed 
  55.47    4.07   59.55 
> system.time(dplyr::if_else(cnd, yes, no))
   user  system elapsed 
 105.97   52.11  159.17 
> system.time(dplyr::if_else(cnd, yes, no))
   user  system elapsed 
 109.05   59.56  168.74 
> system.time(ifelse(cnd, yes, no))
   user  system elapsed 
 467.92   51.15  520.66 

While I never intended it to be faster when I wrote it, I'm confident hutils::if_else is always faster than dplyr::if_else and include that assertion as a unit test, only skipping on CRAN because it takes a while to run (https://github.com/HughParsonage/hutils/blob/15e62bdf235dc6084f1006824a20a42fab12ce9c/tests/testthat/test_1-0-0_if_else.R#L149).

Similar with hutils::coalesce v dplyr::coalesce. (Again, nothing wrong or even slow with the original -- just needed it in situations where compile time was important).

@HughParsonage
Copy link
Member

HughParsonage commented Mar 17, 2018

Regarding the inclusion in base R, it seems a discussion on R-dev crops up every so often (e.g. ifelse() woes ... can we agree on a ifelse2() ? ), but good points are raised. Hadley's point that even if we didn't care about backwards-compatibility, the "correct" result of ifelse(c(TRUE, FALSE), factor(c("a", "b")), factor(c("c", "b")) is unclear. That gives me a bit of pause about implementation, but my current thought is to use hutils::if_else as the superstructure but instead of raising an error when there is some mismatch, falling back to base::ifelse, thus preserving backwards-compatibility but getting performance and a missing argument.

@jangorecki
Copy link
Member

We could optimize ifelse into fifelse
To reduce a risk we could "optimize" it to call both functions, compare output, if differs then ask user to report. Then after some time we could reduce extra overhead of 2 calls and leave only fifelse.

@MichaelChirico
Copy link
Member

Corresponding PR #3800 was closed as waiting for some strong user request for this. Closing here now too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants