[R-Forge #2187] Add/document rolling mean, median etc.. combined with i #624

arunsrinivasan · 2014-06-08T13:16:46Z

Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link

As highlighted in these :

http://stackoverflow.com/questions/1309263/rolling-median-algorithm-in-c
http://stackoverflow.com/questions/11676476/r-data-table-sliding-window

Functions to implement:

rollmean
rollmedian
rollmin
rollmax
...

jangorecki · 2014-07-29T07:43:05Z

can we have it included in 1.9.4 milestone?

also this link could be included as reference for FR: http://stackoverflow.com/questions/11676476/r-data-table-sliding-window

jangorecki · 2014-10-28T16:22:03Z

Couldn't that be done for "rolling function" instead? is it matter of optimization (for mean and median) behind the scene? if no, then I believe the documentation for custom user defined function would be much better because it could be easily applied on mean or median.

arunsrinivasan · 2014-10-28T16:30:46Z

Do you refer to functions having the name rollfun and then fun = "mean" etc instead of rollmean, rollmed etc? If so, I really haven't thought about it, but prefer to rollfun name with options inside (consistent with keeping related operations together in data.table).

jangorecki · 2014-10-28T19:50:01Z

Yes, exactly this is what I mean. Currently the most common name of such function would be rollapply from zoo package. There are also well developed lapply/mapply.
Here is the benchmark which includes all of the current approaches: performance-of-rolling-window-functions-in-r,

eantonya · 2014-10-28T20:19:54Z

Specifically for rolling mean/median/sd/min/max I think you need to beat or at least match caTools for this to be worth it.

@jangorecki arbitrary functions are going to be a lot slower than some specific ones: http://stackoverflow.com/q/18113323/817778

arunsrinivasan · 2014-10-28T21:04:00Z

@eantonya good to know. If that package already has optimised/fast functions, then maybe this needn't be a top priority then? We could try and implement as efficient as possible when there's more time, and then compare and export if necessary...

jangorecki · 2014-10-28T22:13:30Z

@eantonya caTools function runmean(x, k, ...) does not accept k as vector of corresponding window width for each observation. This is kind of hard limitation.
@arunsrinivasan In case if we would rely on external rolling function for now, it could be well documented how to use them most efficient with data.table. SO benchmarks can be also a good place.

eantonya · 2014-10-28T22:21:10Z

@jangorecki do you mean you want to have a separate k for each group (in which case this is very easy to do with runmean) or do you mean smth else?

arunsrinivasan · 2014-10-28T22:34:10Z

@arunsrinivasan In case if we would rely on external rolling function for now, it could be well documented how to use them most efficient with data.table. SO benchmarks can be also a good place.

Yes that makes sense. But if you could clarify your earlier point about “does not accept k as vector” (maybe with an example), it’d help prioritise.

jangorecki · 2014-10-28T23:20:40Z

on the common zoo usage:

library(zoo)
x <- runif(1000,0.5,1.5)
width <- rep(seq(from = 5, to = 20, by = 5), length(x)/4)
str(width)
# SMA period weighted by custom width
SMAvar <- rollapplyr(data = x, width = width, FUN = mean, fill = NA)

in the mentioned SO performance-of-rolling-window-functions-in-r there are four approaches to achieve that - one is simple for loop example so also can be helpful.

arunsrinivasan · 2014-10-28T23:31:14Z

@jangorecki 👍 thanks!

jangorecki · 2015-01-16T12:10:14Z

Leaving for reference another example. This time rolling sum by group for irregular time series: http://stackoverflow.com/a/27983553/2490497
Thanks to shift.

jangorecki · 2015-04-03T18:22:47Z

I've made a data.table solution for mentioned SO - "rolling window with variable window width". Still kind of workaround solution.

dtfmapply <- function(x, width, FUN = NULL, ...){
    FUN <- match.fun(FUN)
    dt <- list(x = x, width = width)
    setDT(dt)[, end := .I]
    dt2 <- copy(dt)
    dt[, start := end-width+1L]
    dt2[, start := end]
    setkey(dt,start,end)
    setcolorder(dt2,c("start","end","x","width"))
    foverlaps(dt2, dt, by.x=1:2, type="within")[, .(adaptive_x = FUN(i.x, ...)),, .(end,start)][start < 1L, adaptive_x := NA_real_][,adaptive_x]
}

library(microbenchmark)
x <- runif(10000,0.5,1.5)
width <- rep(seq(from = 50, to = 200, by = 50), length(x)/4)
microbenchmark(
    rollapplyr(x, width, mean,fill=NA),
    dtfmapply(x = x, width = width, FUN = mean),
    times=10L
)
# Unit: milliseconds
#                                         expr      min       lq     mean   median       uq      max neval
#        rollapplyr(x, width, mean, fill = NA) 806.7374 857.4495 909.7183 905.1389 972.7963 1011.892    10
#  dtfmapply(x = x, width = width, FUN = mean) 740.6762 764.8223 809.4825 784.9917 809.2053 1071.053    10

Still not as fast as mentioned wmapply in the SO.

andrewuhl · 2015-12-02T20:45:51Z

@arunsrinivasan , @jangorecki, @eantonya: I've implemented a collection of single pass rolling window functions: https://github.com/andrewuhl/RollingWindow

The package is written in C++ using Rcpp and is fast - at least as fast as caTools using microbenchmark for non-trivial cases. I'm going to submit to CRAN in the coming week but for testing it's available.

jangorecki · 2015-12-02T21:21:10Z

@andrewuhl thanks for letting us know. For a simple cases, like constant window size, we can now use also shift + rowMeans which should beat timings mentioned above. Would be great if you could put the timings of your package into mentioned SO.

arunsrinivasan · 2015-12-03T16:32:10Z

@andrewuhl great work! Congrats! The reason we want to implement these functions within data.table is to improve performance by avoiding function calls for each group.

For example:

require(data.table)
set.seed(1L)
dt = data.table(x = sample(1e5, 1e7, TRUE), y = runif(1e7), z=runif(1e7))

# 1. internally optimised to avoid evaluating j for each group
system.time(dt[, lapply(.SD, median), by=x, verbose=TRUE])
# Finding groups (bysameorder=FALSE) ... done in 0.236secs. bysameorder=FALSE and o__ is length 10000000
# lapply optimization changed j from 'lapply(.SD, median)' to 'list(median(y), median(z))'
# GForce optimized j to 'list(gmedian(y), gmedian(z))'
#    user  system elapsed 
#   0.897   0.047   0.964 

# 2. try it by breaking that (GForce) optimisation
system.time(dt[, lapply(.SD, stats::median), by=x, verbose=TRUE])
# Finding groups (bysameorder=FALSE) ... done in 0.25secs. bysameorder=FALSE and o__ is length 10000000
# lapply optimization changed j from 'lapply(.SD, stats::median)' to 'list(stats::median(y), stats::median(z))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Starting dogroups ... 
#   collecting discontiguous groups took 0.601s for 100000 groups
#   eval(j) took 7.482s for 100000 calls
# done dogroups in 8.456 secs
#    user  system elapsed 
#   8.052   0.486   8.708

We can get things quite performant by optimising these functions (for memory and speed). In that sense, while your package has a lot to offer in terms of functionality (congrats once again!), we could squeeze in a lot more in terms of speed/memory efficiency. What do you think?

We'd like it to be implemented with R's C-API. There are already quite a few functions optimised with GForce. I'll be working on it shortly to expand to other functions once GForce is extended to be able to handle :=. We'd be happy to use your expertise, if you'd be interested.

@jangorecki I've no idea why you think shift + rowMeans would be more performant... so why :-)? I don't think so because rowMeans needs a matrix and shift() will allocate too many times unnecessarily.

jangorecki · 2015-12-03T19:55:47Z

@arunsrinivasan more performant than solutions like rollapply but not than C++ or C.

andrewuhl · 2015-12-03T21:09:08Z

The single pass functions I implemented run in O(n) unless otherwise noted, so in terms of big O complexity they're efficient (including rolling min/max and median from the papers I read). Since they're written in C++, I'd be surprised if a pure C solution would run any faster, assuming optimizations enabled on the C/C++ compilers. There are some tweaks probably to reduce a few additions/multiplications, plus I haven't really done anything directly to improve cache locality (although this should be reasonably good given that it's single pass). In terms of memory, things should be good (array layouts contiguous in memory, elements sequentially accessed for cache, subject to my notes in the source code); I don't think I'm making unnecessary data copies, but it's possible. Side note: the rolling kurtosis function isn't backwards/forwards stable numerically for data sets with extremely small variance, but I'll tweak the algorithm to fix round off issues that in the coming days. There is some other low hanging fruit as well, but the package is probably near optimally fast in terms of time efficiency on a CPU, but if you mean for GPU execution, somebody else with that knowledge would have to do that part.

arunsrinivasan · 2015-12-03T22:26:42Z

There are two things here:

While your package's functionality is a great addition to R, we'd want an internal implementation. I've shown one example above, i.e., we'd want to optimise obtaining the rolling stats while run for many groups. Evaluating functions over groups becomes costly.
The internal implementation needs to be in C (which has nothing to do with C++ and its performance) because most of our package is in C already, and we're quite comfortable with it, therefore can maintain it. It also allows data.table to provide support for quite older versions of R.

jangorecki · 2018-04-21T03:53:22Z

closing as duplicate of #2778

arunsrinivasan added the High label Oct 28, 2014

arunsrinivasan added this to the v1.9.8 milestone Oct 28, 2014

arunsrinivasan changed the title ~~[R-Forge #2187] Add/document rolling mean and median combined with i~~ [R-Forge #2187] Add/document rolling mean, median etc.. combined with i Oct 28, 2014

arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Mar 20, 2016

jangorecki self-assigned this Apr 1, 2018

jangorecki modified the milestones: Candidate, v1.11.2 Apr 21, 2018

jangorecki closed this as completed Apr 21, 2018

jangorecki removed this from the v1.11.2 milestone Apr 21, 2018

jangorecki added the froll label Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-Forge #2187] Add/document rolling mean, median etc.. combined with i #624

[R-Forge #2187] Add/document rolling mean, median etc.. combined with i #624

arunsrinivasan commented Jun 8, 2014

jangorecki commented Jul 29, 2014

jangorecki commented Oct 28, 2014

arunsrinivasan commented Oct 28, 2014

jangorecki commented Oct 28, 2014

eantonya commented Oct 28, 2014

arunsrinivasan commented Oct 28, 2014

jangorecki commented Oct 28, 2014

eantonya commented Oct 28, 2014

arunsrinivasan commented Oct 28, 2014

jangorecki commented Oct 28, 2014

arunsrinivasan commented Oct 28, 2014

jangorecki commented Jan 16, 2015

jangorecki commented Apr 3, 2015

andrewuhl commented Dec 2, 2015

jangorecki commented Dec 2, 2015

arunsrinivasan commented Dec 3, 2015

jangorecki commented Dec 3, 2015

andrewuhl commented Dec 3, 2015

arunsrinivasan commented Dec 3, 2015

jangorecki commented Apr 21, 2018

[R-Forge #2187] Add/document rolling mean, median etc.. combined with i #624

[R-Forge #2187] Add/document rolling mean, median etc.. combined with i #624

Comments

arunsrinivasan commented Jun 8, 2014

jangorecki commented Jul 29, 2014

jangorecki commented Oct 28, 2014

arunsrinivasan commented Oct 28, 2014

jangorecki commented Oct 28, 2014

eantonya commented Oct 28, 2014

arunsrinivasan commented Oct 28, 2014

jangorecki commented Oct 28, 2014

eantonya commented Oct 28, 2014

arunsrinivasan commented Oct 28, 2014

jangorecki commented Oct 28, 2014

arunsrinivasan commented Oct 28, 2014

jangorecki commented Jan 16, 2015

jangorecki commented Apr 3, 2015

andrewuhl commented Dec 2, 2015

jangorecki commented Dec 2, 2015

arunsrinivasan commented Dec 3, 2015

jangorecki commented Dec 3, 2015

andrewuhl commented Dec 3, 2015

arunsrinivasan commented Dec 3, 2015

jangorecki commented Apr 21, 2018