-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-Forge #2187] Add/document rolling mean, median etc.. combined with i #624
Comments
can we have it included in 1.9.4 milestone? also this link could be included as reference for FR: http://stackoverflow.com/questions/11676476/r-data-table-sliding-window |
Couldn't that be done for "rolling function" instead? is it matter of optimization (for mean and median) behind the scene? if no, then I believe the documentation for custom user defined function would be much better because it could be easily applied on mean or median. |
Do you refer to functions having the name |
Yes, exactly this is what I mean. Currently the most common name of such function would be |
Specifically for rolling mean/median/sd/min/max I think you need to beat or at least match caTools for this to be worth it. @jangorecki arbitrary functions are going to be a lot slower than some specific ones: http://stackoverflow.com/q/18113323/817778 |
@eantonya good to know. If that package already has optimised/fast functions, then maybe this needn't be a top priority then? We could try and implement as efficient as possible when there's more time, and then compare and export if necessary... |
@eantonya |
@jangorecki do you mean you want to have a separate |
Yes that makes sense. But if you could clarify your earlier point about “does not accept k as vector” (maybe with an example), it’d help prioritise. |
on the common library(zoo)
x <- runif(1000,0.5,1.5)
width <- rep(seq(from = 5, to = 20, by = 5), length(x)/4)
str(width)
# SMA period weighted by custom width
SMAvar <- rollapplyr(data = x, width = width, FUN = mean, fill = NA) in the mentioned SO performance-of-rolling-window-functions-in-r there are four approaches to achieve that - one is simple |
@jangorecki 👍 thanks! |
Leaving for reference another example. This time rolling sum by group for irregular time series: http://stackoverflow.com/a/27983553/2490497 |
I've made a data.table solution for mentioned SO - "rolling window with variable window width". Still kind of workaround solution. dtfmapply <- function(x, width, FUN = NULL, ...){
FUN <- match.fun(FUN)
dt <- list(x = x, width = width)
setDT(dt)[, end := .I]
dt2 <- copy(dt)
dt[, start := end-width+1L]
dt2[, start := end]
setkey(dt,start,end)
setcolorder(dt2,c("start","end","x","width"))
foverlaps(dt2, dt, by.x=1:2, type="within")[, .(adaptive_x = FUN(i.x, ...)),, .(end,start)][start < 1L, adaptive_x := NA_real_][,adaptive_x]
}
library(microbenchmark)
x <- runif(10000,0.5,1.5)
width <- rep(seq(from = 50, to = 200, by = 50), length(x)/4)
microbenchmark(
rollapplyr(x, width, mean,fill=NA),
dtfmapply(x = x, width = width, FUN = mean),
times=10L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rollapplyr(x, width, mean, fill = NA) 806.7374 857.4495 909.7183 905.1389 972.7963 1011.892 10
# dtfmapply(x = x, width = width, FUN = mean) 740.6762 764.8223 809.4825 784.9917 809.2053 1071.053 10 Still not as fast as mentioned |
@arunsrinivasan , @jangorecki, @eantonya: I've implemented a collection of single pass rolling window functions: https://github.com/andrewuhl/RollingWindow The package is written in C++ using Rcpp and is fast - at least as fast as caTools using microbenchmark for non-trivial cases. I'm going to submit to CRAN in the coming week but for testing it's available. |
@andrewuhl thanks for letting us know. For a simple cases, like constant window size, we can now use also |
@andrewuhl great work! Congrats! The reason we want to implement these functions within data.table is to improve performance by avoiding function calls for each group. For example: require(data.table)
set.seed(1L)
dt = data.table(x = sample(1e5, 1e7, TRUE), y = runif(1e7), z=runif(1e7))
# 1. internally optimised to avoid evaluating j for each group
system.time(dt[, lapply(.SD, median), by=x, verbose=TRUE])
# Finding groups (bysameorder=FALSE) ... done in 0.236secs. bysameorder=FALSE and o__ is length 10000000
# lapply optimization changed j from 'lapply(.SD, median)' to 'list(median(y), median(z))'
# GForce optimized j to 'list(gmedian(y), gmedian(z))'
# user system elapsed
# 0.897 0.047 0.964
# 2. try it by breaking that (GForce) optimisation
system.time(dt[, lapply(.SD, stats::median), by=x, verbose=TRUE])
# Finding groups (bysameorder=FALSE) ... done in 0.25secs. bysameorder=FALSE and o__ is length 10000000
# lapply optimization changed j from 'lapply(.SD, stats::median)' to 'list(stats::median(y), stats::median(z))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Starting dogroups ...
# collecting discontiguous groups took 0.601s for 100000 groups
# eval(j) took 7.482s for 100000 calls
# done dogroups in 8.456 secs
# user system elapsed
# 8.052 0.486 8.708 We can get things quite performant by optimising these functions (for memory and speed). In that sense, while your package has a lot to offer in terms of functionality (congrats once again!), we could squeeze in a lot more in terms of speed/memory efficiency. What do you think? We'd like it to be implemented with R's C-API. There are already quite a few functions optimised with GForce. I'll be working on it shortly to expand to other functions once GForce is extended to be able to handle @jangorecki I've no idea why you think |
@arunsrinivasan more performant than solutions like |
The single pass functions I implemented run in O(n) unless otherwise noted, so in terms of big O complexity they're efficient (including rolling min/max and median from the papers I read). Since they're written in C++, I'd be surprised if a pure C solution would run any faster, assuming optimizations enabled on the C/C++ compilers. There are some tweaks probably to reduce a few additions/multiplications, plus I haven't really done anything directly to improve cache locality (although this should be reasonably good given that it's single pass). In terms of memory, things should be good (array layouts contiguous in memory, elements sequentially accessed for cache, subject to my notes in the source code); I don't think I'm making unnecessary data copies, but it's possible. Side note: the rolling kurtosis function isn't backwards/forwards stable numerically for data sets with extremely small variance, but I'll tweak the algorithm to fix round off issues that in the coming days. There is some other low hanging fruit as well, but the package is probably near optimally fast in terms of time efficiency on a CPU, but if you mean for GPU execution, somebody else with that knowledge would have to do that part. |
There are two things here:
|
closing as duplicate of #2778 |
Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link
As highlighted in these :
http://stackoverflow.com/questions/1309263/rolling-median-algorithm-in-c
http://stackoverflow.com/questions/11676476/r-data-table-sliding-window
Functions to implement:
...
The text was updated successfully, but these errors were encountered: