Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #2187] Add/document rolling mean, median etc.. combined with i #624

Closed
4 tasks
arunsrinivasan opened this issue Jun 8, 2014 · 20 comments
Closed
4 tasks

Comments

@arunsrinivasan
Copy link
Member

Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link

As highlighted in these :

http://stackoverflow.com/questions/1309263/rolling-median-algorithm-in-c
http://stackoverflow.com/questions/11676476/r-data-table-sliding-window

Functions to implement:

  • rollmean
  • rollmedian
  • rollmin
  • rollmax
    ...
@jangorecki
Copy link
Member

can we have it included in 1.9.4 milestone?

also this link could be included as reference for FR: http://stackoverflow.com/questions/11676476/r-data-table-sliding-window

@arunsrinivasan arunsrinivasan added this to the v1.9.8 milestone Oct 28, 2014
@arunsrinivasan arunsrinivasan changed the title [R-Forge #2187] Add/document rolling mean and median combined with i [R-Forge #2187] Add/document rolling mean, median etc.. combined with i Oct 28, 2014
@jangorecki
Copy link
Member

Couldn't that be done for "rolling function" instead? is it matter of optimization (for mean and median) behind the scene? if no, then I believe the documentation for custom user defined function would be much better because it could be easily applied on mean or median.

@arunsrinivasan
Copy link
Member Author

Do you refer to functions having the name rollfun and then fun = "mean" etc instead of rollmean, rollmed etc? If so, I really haven't thought about it, but prefer to rollfun name with options inside (consistent with keeping related operations together in data.table).

@jangorecki
Copy link
Member

Yes, exactly this is what I mean. Currently the most common name of such function would be rollapply from zoo package. There are also well developed lapply/mapply.
Here is the benchmark which includes all of the current approaches: performance-of-rolling-window-functions-in-r,

@eantonya
Copy link
Contributor

Specifically for rolling mean/median/sd/min/max I think you need to beat or at least match caTools for this to be worth it.

@jangorecki arbitrary functions are going to be a lot slower than some specific ones: http://stackoverflow.com/q/18113323/817778

@arunsrinivasan
Copy link
Member Author

@eantonya good to know. If that package already has optimised/fast functions, then maybe this needn't be a top priority then? We could try and implement as efficient as possible when there's more time, and then compare and export if necessary...

@jangorecki
Copy link
Member

@eantonya caTools function runmean(x, k, ...) does not accept k as vector of corresponding window width for each observation. This is kind of hard limitation.
@arunsrinivasan In case if we would rely on external rolling function for now, it could be well documented how to use them most efficient with data.table. SO benchmarks can be also a good place.

@eantonya
Copy link
Contributor

@jangorecki do you mean you want to have a separate k for each group (in which case this is very easy to do with runmean) or do you mean smth else?

@arunsrinivasan
Copy link
Member Author

@arunsrinivasan In case if we would rely on external rolling function for now, it could be well documented how to use them most efficient with data.table. SO benchmarks can be also a good place.

Yes that makes sense. But if you could clarify your earlier point about “does not accept k as vector” (maybe with an example), it’d help prioritise.

@jangorecki
Copy link
Member

on the common zoo usage:

library(zoo)
x <- runif(1000,0.5,1.5)
width <- rep(seq(from = 5, to = 20, by = 5), length(x)/4)
str(width)
# SMA period weighted by custom width
SMAvar <- rollapplyr(data = x, width = width, FUN = mean, fill = NA)

in the mentioned SO performance-of-rolling-window-functions-in-r there are four approaches to achieve that - one is simple for loop example so also can be helpful.

@arunsrinivasan
Copy link
Member Author

@jangorecki 👍 thanks!

@jangorecki
Copy link
Member

Leaving for reference another example. This time rolling sum by group for irregular time series: http://stackoverflow.com/a/27983553/2490497
Thanks to shift.

@jangorecki
Copy link
Member

I've made a data.table solution for mentioned SO - "rolling window with variable window width". Still kind of workaround solution.

dtfmapply <- function(x, width, FUN = NULL, ...){
    FUN <- match.fun(FUN)
    dt <- list(x = x, width = width)
    setDT(dt)[, end := .I]
    dt2 <- copy(dt)
    dt[, start := end-width+1L]
    dt2[, start := end]
    setkey(dt,start,end)
    setcolorder(dt2,c("start","end","x","width"))
    foverlaps(dt2, dt, by.x=1:2, type="within")[, .(adaptive_x = FUN(i.x, ...)),, .(end,start)][start < 1L, adaptive_x := NA_real_][,adaptive_x]
}

library(microbenchmark)
x <- runif(10000,0.5,1.5)
width <- rep(seq(from = 50, to = 200, by = 50), length(x)/4)
microbenchmark(
    rollapplyr(x, width, mean,fill=NA),
    dtfmapply(x = x, width = width, FUN = mean),
    times=10L
)
# Unit: milliseconds
#                                         expr      min       lq     mean   median       uq      max neval
#        rollapplyr(x, width, mean, fill = NA) 806.7374 857.4495 909.7183 905.1389 972.7963 1011.892    10
#  dtfmapply(x = x, width = width, FUN = mean) 740.6762 764.8223 809.4825 784.9917 809.2053 1071.053    10

Still not as fast as mentioned wmapply in the SO.

@andrewuhl
Copy link

@arunsrinivasan , @jangorecki, @eantonya: I've implemented a collection of single pass rolling window functions: https://github.com/andrewuhl/RollingWindow

The package is written in C++ using Rcpp and is fast - at least as fast as caTools using microbenchmark for non-trivial cases. I'm going to submit to CRAN in the coming week but for testing it's available.

@jangorecki
Copy link
Member

@andrewuhl thanks for letting us know. For a simple cases, like constant window size, we can now use also shift + rowMeans which should beat timings mentioned above. Would be great if you could put the timings of your package into mentioned SO.

@arunsrinivasan
Copy link
Member Author

@andrewuhl great work! Congrats! The reason we want to implement these functions within data.table is to improve performance by avoiding function calls for each group.

For example:

require(data.table)
set.seed(1L)
dt = data.table(x = sample(1e5, 1e7, TRUE), y = runif(1e7), z=runif(1e7))

# 1. internally optimised to avoid evaluating j for each group
system.time(dt[, lapply(.SD, median), by=x, verbose=TRUE])
# Finding groups (bysameorder=FALSE) ... done in 0.236secs. bysameorder=FALSE and o__ is length 10000000
# lapply optimization changed j from 'lapply(.SD, median)' to 'list(median(y), median(z))'
# GForce optimized j to 'list(gmedian(y), gmedian(z))'
#    user  system elapsed 
#   0.897   0.047   0.964 

# 2. try it by breaking that (GForce) optimisation
system.time(dt[, lapply(.SD, stats::median), by=x, verbose=TRUE])
# Finding groups (bysameorder=FALSE) ... done in 0.25secs. bysameorder=FALSE and o__ is length 10000000
# lapply optimization changed j from 'lapply(.SD, stats::median)' to 'list(stats::median(y), stats::median(z))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Starting dogroups ... 
#   collecting discontiguous groups took 0.601s for 100000 groups
#   eval(j) took 7.482s for 100000 calls
# done dogroups in 8.456 secs
#    user  system elapsed 
#   8.052   0.486   8.708 

We can get things quite performant by optimising these functions (for memory and speed). In that sense, while your package has a lot to offer in terms of functionality (congrats once again!), we could squeeze in a lot more in terms of speed/memory efficiency. What do you think?

We'd like it to be implemented with R's C-API. There are already quite a few functions optimised with GForce. I'll be working on it shortly to expand to other functions once GForce is extended to be able to handle :=. We'd be happy to use your expertise, if you'd be interested.

@jangorecki I've no idea why you think shift + rowMeans would be more performant... so why :-)? I don't think so because rowMeans needs a matrix and shift() will allocate too many times unnecessarily.

@jangorecki
Copy link
Member

@arunsrinivasan more performant than solutions like rollapply but not than C++ or C.

@andrewuhl
Copy link

The single pass functions I implemented run in O(n) unless otherwise noted, so in terms of big O complexity they're efficient (including rolling min/max and median from the papers I read). Since they're written in C++, I'd be surprised if a pure C solution would run any faster, assuming optimizations enabled on the C/C++ compilers. There are some tweaks probably to reduce a few additions/multiplications, plus I haven't really done anything directly to improve cache locality (although this should be reasonably good given that it's single pass). In terms of memory, things should be good (array layouts contiguous in memory, elements sequentially accessed for cache, subject to my notes in the source code); I don't think I'm making unnecessary data copies, but it's possible. Side note: the rolling kurtosis function isn't backwards/forwards stable numerically for data sets with extremely small variance, but I'll tweak the algorithm to fix round off issues that in the coming days. There is some other low hanging fruit as well, but the package is probably near optimally fast in terms of time efficiency on a CPU, but if you mean for GPU execution, somebody else with that knowledge would have to do that part.

@arunsrinivasan
Copy link
Member Author

There are two things here:

  1. While your package's functionality is a great addition to R, we'd want an internal implementation. I've shown one example above, i.e., we'd want to optimise obtaining the rolling stats while run for many groups. Evaluating functions over groups becomes costly.
  2. The internal implementation needs to be in C (which has nothing to do with C++ and its performance) because most of our package is in C already, and we're quite comfortable with it, therefore can maintain it. It also allows data.table to provide support for quite older versions of R.

@arunsrinivasan arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Mar 20, 2016
@jangorecki jangorecki self-assigned this Apr 1, 2018
@jangorecki jangorecki modified the milestones: Candidate, v1.11.2 Apr 21, 2018
@jangorecki
Copy link
Member

closing as duplicate of #2778

@jangorecki jangorecki removed this from the v1.11.2 milestone Apr 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants