-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregations: Add autocorrelation agg #10377
Aggregations: Add autocorrelation agg #10377
Conversation
This is exciting. :) I am not familiar with the theory so please excuse me if my questions are silly.
Is there any particular reason to only apply the analysis to the last elements, could we apply it to all histogram buckets by default instead of the last 5 ones?
Not sure how applicable it is in your case but in such cases, I tend to like having both randomized tests and only check invariants in the output, and static tests when it comes to assessing that the algorithm actually works. Otherwise it can quickly become a nightmare to debug failures. |
No particular reason, I was mostly just thinking about performance (e.g. if you want accidentally ask for an autocorrelation of 100k of points). Perhaps we should default it to everything, but provide window as an option if you don't want the complete autocorrelation history? Practically speaking, ACF becomes less useful (I think) the farther back in time you go. And the higher order lags have more approximation error that accumulates.
Ahh, this makes sense. I'll see what I can do to split the tests into those two categories. |
This makes sense to me. Since you mentioned performance, this got me curious: what is the runtime complexity of this reduction and do you know eg. how much time does it take in practice to process N data points? |
I have no idea :D Real benchmarks are on the top of my to-do list...I'm curious to see where this breaks. Classical radix-2 FFTs have complexity of O(n log n). I'm not sure what optimizations JTransforms is using, it may be better than that. JTransform's has some benchmark results which claim an FFT on 1m datapoints takes 10ms. FFT on 23m values takes 700ms. Timings are a bit slower if you include "construction" of the FFT plan (e.g. when you instantiate the object). For non-padded ACF: two FFTs, one O(n) loop over the data compute magnitudes, potentially an extra O(n) loop to normalize. Note the FFTs will be non radix-2, so may be slower. For padded ACF: four FFTs, two O(n) loops for magnitudes, and potentially an extra O(n) loop to normalize. The brute-force, non-FFT ACF functions are O(n2) |
@colings86 why did you close this, was it merged? |
@clintongormley it auto-closed because the feature/aggs_2_0 got deleted (since it's no longer needed). @polyfractal said it's an old PR anyway and needs to be updated onto the current pipeline aggs so it Ms probably ok to stay closed |
Yep, this needs to be rebased against current master. I'll resubmit it soonish. I'll pull tags from this PR so it doesn't confuse anyone. |
WIP, putting up for discussion.
Depends on the
SiblingReducer
functionality introduced in @colings86's "Max Aggregator" PR, so any changes in that PR will need to be reflected here.No need for a review yet, this is largely just to test the sibling functionality.
Autocorrelation
Autocorrelation shows the similarity between a time series and a "lagged" version of itself at different intervals of time. This can be used to determine if a signal has periodic elements hidden by noise. If there is a periodic element (repeating every
n
elements), there will be a peak in the Autocorrelation everyn
lags. This is because the original time series will "line up" with the lagged version and display a high degree of similarity, even in the presence of noise.As an example, this "Lemmings Population" is a very noisy sine wave with a 30-day period. If you squint hard enough, you can see the sine wave. The ACF of the series, however, clearly shows periodic elements. The peaks are spaced ~27 days, which is very close to the actual 30-day period
Request
ACF is a sibling reducer, which accepts histogram or datehistogram input.
Parameters
bucketsPath
: requiredwindow
: size of time series to perform ACF on. If series is lengthn
, the ACF will be performed onn - window .. n
values. E.g. the most recent values. Optional, defaults to5
zero_mean
: "centers" the ACF by removing the mean from the time series. Optional, defaults totrue
zero_pad
: pads the input data with zeros, up to the nearest power of two. FFTs are faster on powers of 2, and padding converts the ACF from a circular convolution to a linear convolution. Linear are more useful for "real world" use-cases. Optional, defaults totrue
normalize
: Divides all ACF values by variance, which normalizes the ACF to roughly -1..1. Optional, defaults totrue
Response
Todo
AbstractAggregationBuilder
, and the need forInternalAcfBuilder
which is registered as an aggregation (due to siblings potentially being "top level" aggs). Unsure if this was the correct approach?