Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations: Add autocorrelation agg #10377

Closed

Conversation

polyfractal
Copy link
Contributor

WIP, putting up for discussion.

Depends on the SiblingReducer functionality introduced in @colings86's "Max Aggregator" PR, so any changes in that PR will need to be reflected here.

No need for a review yet, this is largely just to test the sibling functionality.

Autocorrelation

Autocorrelation shows the similarity between a time series and a "lagged" version of itself at different intervals of time. This can be used to determine if a signal has periodic elements hidden by noise. If there is a periodic element (repeating every n elements), there will be a peak in the Autocorrelation every n lags. This is because the original time series will "line up" with the lagged version and display a high degree of similarity, even in the presence of noise.

As an example, this "Lemmings Population" is a very noisy sine wave with a 30-day period. If you squint hard enough, you can see the sine wave. The ACF of the series, however, clearly shows periodic elements. The peaks are spaced ~27 days, which is very close to the actual 30-day period

screen shot 2015-04-01 at 1 53 43 pm

Request

ACF is a sibling reducer, which accepts histogram or datehistogram input.

GET /test/test/_search?search_type=count
{
   "aggs": {
      "my_date_histo": {
         "date_histogram": {
            "field": "timestamp",
            "interval": "day",
            "min_doc_count": 0
         },
         "aggs": {
            "the_sum": {
               "sum": {
                  "field": "price"
               }
            }
         }
      },
      "the_acf": {
         "acf": {
            "bucketsPath": "my_date_histo.the_sum",
            "window" : 50
         }
      }
   }
}
Parameters
  • bucketsPath: required
  • window: size of time series to perform ACF on. If series is length n, the ACF will be performed on n - window .. n values. E.g. the most recent values. Optional, defaults to 5
  • zero_mean: "centers" the ACF by removing the mean from the time series. Optional, defaults to true
  • zero_pad: pads the input data with zeros, up to the nearest power of two. FFTs are faster on powers of 2, and padding converts the ACF from a circular convolution to a linear convolution. Linear are more useful for "real world" use-cases. Optional, defaults to true
  • normalize: Divides all ACF values by variance, which normalizes the ACF to roughly -1..1. Optional, defaults to true

Response

{
   "took": 14,
   "timed_out": false,
   "_shards": { ... },
   "hits": { ... },
   "aggregations": {
      "my_date_histo": {
         "buckets": [ ... ]
      },
      "the_acf": {
         "values": [
            1,
            0.37343470483005364,
            -0.360763267740012,
            0.17441860465116257,
            0.5277280858676209
         ]
      }
   }
}

Todo

  • Benchmark the padding vs non-padding speed, particularly if power of two matters. JTransforms can accept any length input, but uses different radix algos which are theoretically slower. OTOH, padding to power of two could mean more memory usage, and a bigger power-2 FFT might be slower than smaller non-power-2 FFT.
  • Add setting to disable linear convolution correction when using padding? If you don't care about the correction, it'll save some CPU cycles by eliminating the mask (forward FFT, PSD computation, inverse FFT)
  • Settings configuration is a bit wonky, talk to Colin about a better way?
  • Talk to Colin about how sibling builders need to extend AbstractAggregationBuilder, and the need for InternalAcfBuilder which is registered as an aggregation (due to siblings potentially being "top level" aggs). Unsure if this was the correct approach?
  • Validate the agg structure, since this can only accept histo siblings
  • More tests. Tests are hard since the "brute force" approaches are also approximations...so we are comparing approximations against approximations. Unsure how much we can randomize these tests for that reason. Needs more thinking.

@jpountz
Copy link
Contributor

jpountz commented Apr 1, 2015

This is exciting. :) I am not familiar with the theory so please excuse me if my questions are silly.

window: size of time series to perform ACF on. If series is length n, the ACF will be performed on n - window .. n values. E.g. the most recent values. Optional, defaults to 5

Is there any particular reason to only apply the analysis to the last elements, could we apply it to all histogram buckets by default instead of the last 5 ones?

Unsure how much we can randomize these tests for that reason. Needs more thinking.

Not sure how applicable it is in your case but in such cases, I tend to like having both randomized tests and only check invariants in the output, and static tests when it comes to assessing that the algorithm actually works. Otherwise it can quickly become a nightmare to debug failures.

@polyfractal
Copy link
Contributor Author

Is there any particular reason to only apply the analysis to the last elements, could we apply it to all histogram buckets by default instead of the last 5 ones?

No particular reason, I was mostly just thinking about performance (e.g. if you want accidentally ask for an autocorrelation of 100k of points). Perhaps we should default it to everything, but provide window as an option if you don't want the complete autocorrelation history?

Practically speaking, ACF becomes less useful (I think) the farther back in time you go. And the higher order lags have more approximation error that accumulates.

Not sure how applicable it is in your case but in such cases, I tend to like having both randomized tests and only check invariants in the output, and static tests when it comes to assessing that the algorithm actually works. Otherwise it can quickly become a nightmare to debug failures.

Ahh, this makes sense. I'll see what I can do to split the tests into those two categories.

@jpountz
Copy link
Contributor

jpountz commented Apr 1, 2015

Perhaps we should default it to everything, but provide window as an option if you don't want the complete autocorrelation history?

This makes sense to me. Since you mentioned performance, this got me curious: what is the runtime complexity of this reduction and do you know eg. how much time does it take in practice to process N data points?

@polyfractal
Copy link
Contributor Author

This makes sense to me. Since you mentioned performance, this got me curious: what is the runtime complexity of this reduction and do you know eg. how much time does it take in practice to process N data points?

I have no idea :D Real benchmarks are on the top of my to-do list...I'm curious to see where this breaks.

Classical radix-2 FFTs have complexity of O(n log n). I'm not sure what optimizations JTransforms is using, it may be better than that. JTransform's has some benchmark results which claim an FFT on 1m datapoints takes 10ms. FFT on 23m values takes 700ms. Timings are a bit slower if you include "construction" of the FFT plan (e.g. when you instantiate the object).

For non-padded ACF: two FFTs, one O(n) loop over the data compute magnitudes, potentially an extra O(n) loop to normalize. Note the FFTs will be non radix-2, so may be slower.

For padded ACF: four FFTs, two O(n) loops for magnitudes, and potentially an extra O(n) loop to normalize.

The brute-force, non-FFT ACF functions are O(n2)

@clintongormley
Copy link
Contributor

@colings86 why did you close this, was it merged?

@colings86
Copy link
Contributor

@clintongormley it auto-closed because the feature/aggs_2_0 got deleted (since it's no longer needed). @polyfractal said it's an old PR anyway and needs to be updated onto the current pipeline aggs so it Ms probably ok to stay closed

@polyfractal
Copy link
Contributor Author

Yep, this needs to be rebased against current master. I'll resubmit it soonish.

I'll pull tags from this PR so it doesn't confuse anyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants