-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
computing correlation for aggregated results on fly #27983
Comments
Hi @panffeng! We chatted about this and we'd really like to have a generic correlation pipeline agg, but unsure when/if we'll get time to work on it. We'd be happy to work with you on a PR though, if you were wanting to contribute some code. I started an autocorrelation pipeline aggregation back in the 2.x age (#10377). In the PR, I opted to use the FFT approach since it's O(n log n) rather the non-FFT O(n2). But that required an external library dependency (I used JTransforms at the time), which we wanted to avoid and so the PR was never merged. Moving forward, I think we have some options:
|
Hi @polyfractal , good to know that the correlation is a candidate feature. I would like to work with you guys for a PR. I guess we can take a progressive approach for the two options. First, for a small batch of buckets in a few thousand, naive approach should work. Then, for a large batch of buckets, we can do fft for the time series, compute the inner product, and do reverse fft. I checked your PR. The PR was based on 2.x. So now, we should base on current master, right? Also the PR was mainly about ACF. Now we are going to implement a generic correlation including ACF, aren't we? |
Sounds good!
This seems reasonable. I never tested the naive approach, perhaps it is fast enough... we can do some benchmarks to see if we need to go through the effort of the fft approach. Although @colings86 reminded me that a few thousand buckets turns into a million iterations with O(n2), so I guess we'll see :)
Correct, we'd want to target master for the new PR. If/once merged, we can backport it to the appropriate branches. No need to use that old PR code either, it's probably so hopelessly out of date it isn't even worth looking at.
++ I think it makes more sense to do a generic correlation, more widely useful. I wonder if we should make the API syntax a bit more explicit? "correl": {
"first_series_path": "red_cars>sales",
"second_series_path": "total_sales",
"lag": 1
} Not great naming, but that's the idea. What do you think? Also, I wonder if we should support multiple lags (e.g. |
Hi @polyfractal, I started writing codes for the aggregation. I will take the suggestions for 2 time series and multiple lags into consideration and implementation. I got an issue here. It's about the aggregation type and the result position. If the correlation is a sub aggregation of a date_histogram, then the correlation will extend AbstractPipelineAggregationBuilder and put the correlation result within the result of date histogram. It would be confusing for the correlation within the results of 2 time series. We can make the correlation sibling aggregation of a date_histogram. Then the correlation will extend BucketMetricsPipelineAggregationBuilder and put the correlation result in sibling of the result of date histogram. However, the method reduce in class BucketMetricsPipelineAggregator specifies bucketsPaths()[0], which uses only the first buckets path. For a generic correlation, we use at least 2 buckets path. So we would have to introduce a new Aggregator, say MultipleBucketMetricsPipelineAggregator, for this case? Another question is about missing values or nulls in the time series. We will just drop the pair if either of values is missing or null as R cor and Excel CORREL operators do, won't we? |
@panffeng note that a sibling pipeline aggregator does not need to extend BucketMetricsPipelineAggregationBuilder for the builder it can just extend AbstractPipelineAggregationBuilder. The requirement for a sibling pipeline aggregation is that the aggregator class extend SiblingPipelineAggregator |
Hi @colings86 , thanks. I will look into the class AbstractPipelineAggregationBuilder. |
I think we should follow how other pipeline aggs and use the The |
Hi @polyfractal, I wrote a very basic version with basic test here panffeng@d64f3b2 The correlation aggregator extends SiblingPipelineAggregator directly. Multiple lags are not yet supported. Lags are used much more frequent for autocorrelation. So my plan is to compute autocorrelation with lags and to compute cross correlation with zero lag. Does this make sense? Also, is there any coding example for the new plugin SPI you mentioned earlier? Or is it for plugins only? I want to add the implementation of FFT approach. Thanks. |
cc @elastic/es-search-aggs |
Hiya @panffeng, sorry for the delay. This slipped through my inbox and I didn't notice your reply. We don't have any examples of a plugin using a pipeline aggregation, the closest is a module that adds the matrix-stats aggregation (https://github.com/elastic/elasticsearch/tree/master/modules/aggs-matrix-stats). That would contain a lot of similar boilerplate, but probably not identical. There may be parts that just don't work with trying to plug a pipeline agg in yet either... I'm not sure, would have to dig into the code closer. I quickly skimmed your commit, only note is that if possible you could try using the newer static parser style (like this: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/search/aggregations/pipeline/movfn/MovFnPipelineAggregationBuilder.java#L62). It tends to be easier to read since it's more compact. It isn't always possible to use however, depending on what you need to do. It'd probably be easiest to move this forward as a PR so we can comment on it directly and help out, if you're still interested in working on it. Sorry again for the delay! |
Closing as this seems to have stalled. If this is still of interest, feel free to open a PR and the team can revisit! |
Describe the feature: Elasticsearch is often used for data and time series analysis. Yet, some common functions like computing correlation between 2 time series are not supported.
The matrix_stats aggregations can only use fields in the documents. (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-matrix-stats-aggregation.html) Frequently, the time series are aggregated on fly. To write down intermediate results as new documents is not a good option.
Script aggregation is another choice to implement correlation. Yet the script to compute correlation would complicate the overall processing. (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-bucket-script-aggregation.html)
Currently, it would need to do this calculation client-side or indeed implement it in a plugin.
Suppose correl aggregation is directly available and it accepts aggregated results as inputs. A simple query would give out the correlation. For example, the following DSL would compute correlation for the monthly total Sales and monthly red cars sales. ( The data set is about car sales at https://www.elastic.co/guide/en/elasticsearch/guide/current/_aggregation_test_drive.html )
The text was updated successfully, but these errors were encountered: