Aggregations: Add autocorrelation agg #10377

polyfractal · 2015-04-01T18:34:43Z

WIP, putting up for discussion.

Depends on the SiblingReducer functionality introduced in @colings86's "Max Aggregator" PR, so any changes in that PR will need to be reflected here.

No need for a review yet, this is largely just to test the sibling functionality.

Autocorrelation

Autocorrelation shows the similarity between a time series and a "lagged" version of itself at different intervals of time. This can be used to determine if a signal has periodic elements hidden by noise. If there is a periodic element (repeating every n elements), there will be a peak in the Autocorrelation every n lags. This is because the original time series will "line up" with the lagged version and display a high degree of similarity, even in the presence of noise.

As an example, this "Lemmings Population" is a very noisy sine wave with a 30-day period. If you squint hard enough, you can see the sine wave. The ACF of the series, however, clearly shows periodic elements. The peaks are spaced ~27 days, which is very close to the actual 30-day period

Request

ACF is a sibling reducer, which accepts histogram or datehistogram input.

GET /test/test/_search?search_type=count
{
   "aggs": {
      "my_date_histo": {
         "date_histogram": {
            "field": "timestamp",
            "interval": "day",
            "min_doc_count": 0
         },
         "aggs": {
            "the_sum": {
               "sum": {
                  "field": "price"
               }
            }
         }
      },
      "the_acf": {
         "acf": {
            "bucketsPath": "my_date_histo.the_sum",
            "window" : 50
         }
      }
   }
}

Parameters

bucketsPath: required
window: size of time series to perform ACF on. If series is length n, the ACF will be performed on n - window .. n values. E.g. the most recent values. Optional, defaults to 5
zero_mean: "centers" the ACF by removing the mean from the time series. Optional, defaults to true
zero_pad: pads the input data with zeros, up to the nearest power of two. FFTs are faster on powers of 2, and padding converts the ACF from a circular convolution to a linear convolution. Linear are more useful for "real world" use-cases. Optional, defaults to true
normalize: Divides all ACF values by variance, which normalizes the ACF to roughly -1..1. Optional, defaults to true

Response

{
   "took": 14,
   "timed_out": false,
   "_shards": { ... },
   "hits": { ... },
   "aggregations": {
      "my_date_histo": {
         "buckets": [ ... ]
      },
      "the_acf": {
         "values": [
            1,
            0.37343470483005364,
            -0.360763267740012,
            0.17441860465116257,
            0.5277280858676209
         ]
      }
   }
}

Todo

Benchmark the padding vs non-padding speed, particularly if power of two matters. JTransforms can accept any length input, but uses different radix algos which are theoretically slower. OTOH, padding to power of two could mean more memory usage, and a bigger power-2 FFT might be slower than smaller non-power-2 FFT.
Add setting to disable linear convolution correction when using padding? If you don't care about the correction, it'll save some CPU cycles by eliminating the mask (forward FFT, PSD computation, inverse FFT)
Settings configuration is a bit wonky, talk to Colin about a better way?
Talk to Colin about how sibling builders need to extend AbstractAggregationBuilder, and the need for InternalAcfBuilder which is registered as an aggregation (due to siblings potentially being "top level" aggs). Unsure if this was the correct approach?
Validate the agg structure, since this can only accept histo siblings
More tests. Tests are hard since the "brute force" approaches are also approximations...so we are comparing approximations against approximations. Unsure how much we can randomize these tests for that reason. Needs more thinking.

jpountz · 2015-04-01T21:52:42Z

This is exciting. :) I am not familiar with the theory so please excuse me if my questions are silly.

window: size of time series to perform ACF on. If series is length n, the ACF will be performed on n - window .. n values. E.g. the most recent values. Optional, defaults to 5

Is there any particular reason to only apply the analysis to the last elements, could we apply it to all histogram buckets by default instead of the last 5 ones?

Unsure how much we can randomize these tests for that reason. Needs more thinking.

Not sure how applicable it is in your case but in such cases, I tend to like having both randomized tests and only check invariants in the output, and static tests when it comes to assessing that the algorithm actually works. Otherwise it can quickly become a nightmare to debug failures.

polyfractal · 2015-04-01T22:05:49Z

Is there any particular reason to only apply the analysis to the last elements, could we apply it to all histogram buckets by default instead of the last 5 ones?

No particular reason, I was mostly just thinking about performance (e.g. if you want accidentally ask for an autocorrelation of 100k of points). Perhaps we should default it to everything, but provide window as an option if you don't want the complete autocorrelation history?

Practically speaking, ACF becomes less useful (I think) the farther back in time you go. And the higher order lags have more approximation error that accumulates.

Not sure how applicable it is in your case but in such cases, I tend to like having both randomized tests and only check invariants in the output, and static tests when it comes to assessing that the algorithm actually works. Otherwise it can quickly become a nightmare to debug failures.

Ahh, this makes sense. I'll see what I can do to split the tests into those two categories.

jpountz · 2015-04-01T22:15:07Z

Perhaps we should default it to everything, but provide window as an option if you don't want the complete autocorrelation history?

This makes sense to me. Since you mentioned performance, this got me curious: what is the runtime complexity of this reduction and do you know eg. how much time does it take in practice to process N data points?

polyfractal · 2015-04-01T22:54:55Z

This makes sense to me. Since you mentioned performance, this got me curious: what is the runtime complexity of this reduction and do you know eg. how much time does it take in practice to process N data points?

I have no idea :D Real benchmarks are on the top of my to-do list...I'm curious to see where this breaks.

Classical radix-2 FFTs have complexity of O(n log n). I'm not sure what optimizations JTransforms is using, it may be better than that. JTransform's has some benchmark results which claim an FFT on 1m datapoints takes 10ms. FFT on 23m values takes 700ms. Timings are a bit slower if you include "construction" of the FFT plan (e.g. when you instantiate the object).

For non-padded ACF: two FFTs, one O(n) loop over the data compute magnitudes, potentially an extra O(n) loop to normalize. Note the FFTs will be non radix-2, so may be slower.

For padded ACF: four FFTs, two O(n) loops for magnitudes, and potentially an extra O(n) loop to normalize.

The brute-force, non-FFT ACF functions are O(n²)

clintongormley · 2015-05-25T11:57:32Z

@colings86 why did you close this, was it merged?

colings86 · 2015-05-25T12:01:14Z

@clintongormley it auto-closed because the feature/aggs_2_0 got deleted (since it's no longer needed). @polyfractal said it's an old PR anyway and needs to be updated onto the current pipeline aggs so it Ms probably ok to stay closed

polyfractal · 2015-05-26T14:23:30Z

Yep, this needs to be rebased against current master. I'll resubmit it soonish.

I'll pull tags from this PR so it doesn't confuse anyone.

$polyfractal$

colings86 and others added 5 commits March 25, 2015 01:26

max bucket reducer and sibling reducer framework

eb7bb16

added validation

e01cfa9

wired sibling reducers into the agg execution and did some clean up

c52f567

$@polyfractal$

First pass at autocorrelation agg

3e363fb

$@polyfractal$

Do not zero out the DC component for the mask correction

6f2c602

$@polyfractal$ polyfractal added >feature v2.0.0-beta1 WIP :Analytics/Aggregations Aggregations labels Apr 1, 2015

colings86 mentioned this pull request Apr 1, 2015

Add ability to perform computations on aggregations #9876

Closed

24 tasks

s1monw assigned colings86 Apr 2, 2015

colings86 closed this May 19, 2015

clintongormley removed the v2.0.0-beta1 label May 25, 2015

$@polyfractal$ polyfractal removed :Analytics/Aggregations Aggregations >feature WIP labels May 26, 2015

$@polyfractal$ polyfractal mentioned this pull request Jan 12, 2018

computing correlation for aggregated results on fly #27983

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregations: Add autocorrelation agg #10377

Aggregations: Add autocorrelation agg #10377

$@polyfractal$ polyfractal commented Apr 1, 2015

jpountz commented Apr 1, 2015

polyfractal commented Apr 1, 2015

jpountz commented Apr 1, 2015

polyfractal commented Apr 1, 2015

clintongormley commented May 25, 2015

colings86 commented May 25, 2015

polyfractal commented May 26, 2015

Aggregations: Add autocorrelation agg #10377

Aggregations: Add autocorrelation agg #10377

Conversation

polyfractal commented Apr 1, 2015

Autocorrelation

Request

Parameters

Response

Todo

jpountz commented Apr 1, 2015

polyfractal commented Apr 1, 2015

jpountz commented Apr 1, 2015

polyfractal commented Apr 1, 2015

clintongormley commented May 25, 2015

colings86 commented May 25, 2015

polyfractal commented May 26, 2015

$@polyfractal$ polyfractal commented Apr 1, 2015