computing correlation for aggregated results on fly #27983

panffeng · 2017-12-25T13:54:13Z

Describe the feature: Elasticsearch is often used for data and time series analysis. Yet, some common functions like computing correlation between 2 time series are not supported.

The matrix_stats aggregations can only use fields in the documents. (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-matrix-stats-aggregation.html) Frequently, the time series are aggregated on fly. To write down intermediate results as new documents is not a good option.

Script aggregation is another choice to implement correlation. Yet the script to compute correlation would complicate the overall processing. (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-bucket-script-aggregation.html)

Currently, it would need to do this calculation client-side or indeed implement it in a plugin.

Suppose correl aggregation is directly available and it accepts aggregated results as inputs. A simple query would give out the correlation. For example, the following DSL would compute correlation for the monthly total Sales and monthly red cars sales. ( The data set is about car sales at https://www.elastic.co/guide/en/elasticsearch/guide/current/_aggregation_test_drive.html )

{
    "size": 0,
    "aggs": {
        "sales_per_month": {
            "date_histogram": {
                "field": "sold",
                "interval": "month"
            },
            "aggs": {
                "total_sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "red_cars": {
                    "filter": {
                        "term": {
                            "color": "red"
                        }
                    },
                    "aggs": {
                        "sales": {
                            "sum": {
                                "field": "price"
                            }
                        }
                    }
                },
                "correlation": {
                    "correl": {
                        "buckets_path": {
                            "redCarsSales": "red_cars>sales",
                            "totalSales": "total_sales"
                        },
                        "lag": 1
                    }
                }
            }
        }
    }
}

The text was updated successfully, but these errors were encountered:

polyfractal · 2018-01-12T15:26:05Z

Hi @panffeng! We chatted about this and we'd really like to have a generic correlation pipeline agg, but unsure when/if we'll get time to work on it. We'd be happy to work with you on a PR though, if you were wanting to contribute some code.

I started an autocorrelation pipeline aggregation back in the 2.x age (#10377). In the PR, I opted to use the FFT approach since it's O(n log n) rather the non-FFT O(n²). But that required an external library dependency (I used JTransforms at the time), which we wanted to avoid and so the PR was never merged.

Moving forward, I think we have some options:

Implement correlation agg using the naive O(n²) approach. Pipeline aggs generally operate on small magnitudes anyway (thousands of buckets), so the expensive runtime complexity may not be a concern in practice
Implement using FFT + library dependency. The new plugin SPI work allows plugins to extend the framework and include their own dependencies without "polluting" the core with dependencies. We'd just have to modify pipeline aggs to allow extending their functionality through SPI

panffeng · 2018-01-15T03:38:10Z

Hi @polyfractal , good to know that the correlation is a candidate feature. I would like to work with you guys for a PR.

I guess we can take a progressive approach for the two options. First, for a small batch of buckets in a few thousand, naive approach should work. Then, for a large batch of buckets, we can do fft for the time series, compute the inner product, and do reverse fft.

I checked your PR. The PR was based on 2.x. So now, we should base on current master, right? Also the PR was mainly about ACF. Now we are going to implement a generic correlation including ACF, aren't we?

polyfractal · 2018-01-16T16:20:20Z

Sounds good!

I guess we can take a progressive approach for the two options. First, for a small batch of buckets in a few thousand, naive approach should work. Then, for a large batch of buckets, we can do fft for the time series, compute the inner product, and do reverse fft.

This seems reasonable. I never tested the naive approach, perhaps it is fast enough... we can do some benchmarks to see if we need to go through the effort of the fft approach. Although @colings86 reminded me that a few thousand buckets turns into a million iterations with O(n²), so I guess we'll see :)

I checked your PR. The PR was based on 2.x. So now, we should base on current master, right?

Correct, we'd want to target master for the new PR. If/once merged, we can backport it to the appropriate branches. No need to use that old PR code either, it's probably so hopelessly out of date it isn't even worth looking at.

Also the PR was mainly about ACF. Now we are going to implement a generic correlation including ACF, aren't we?

++ I think it makes more sense to do a generic correlation, more widely useful.

I wonder if we should make the API syntax a bit more explicit? buckets_path in other aggs lets you specify as many paths as you want (and to name them for use in scripts), but here we only want two. Maybe something like:

"correl": {
  "first_series_path": "red_cars>sales",
  "second_series_path": "total_sales",
  "lag": 1
}

Not great naming, but that's the idea. What do you think?

Also, I wonder if we should support multiple lags (e.g. "lag": [1,2,3,4,5]), so that correlogram plots can be created with a single aggregation? Not sure how much that'd complicate the code or math, haven't looked at this sort of thing in a long time :)

panffeng · 2018-01-21T02:55:56Z

Hi @polyfractal, I started writing codes for the aggregation. I will take the suggestions for 2 time series and multiple lags into consideration and implementation.

I got an issue here. It's about the aggregation type and the result position.

If the correlation is a sub aggregation of a date_histogram, then the correlation will extend AbstractPipelineAggregationBuilder and put the correlation result within the result of date histogram. It would be confusing for the correlation within the results of 2 time series.

We can make the correlation sibling aggregation of a date_histogram. Then the correlation will extend BucketMetricsPipelineAggregationBuilder and put the correlation result in sibling of the result of date histogram. However, the method reduce in class BucketMetricsPipelineAggregator specifies bucketsPaths()[0], which uses only the first buckets path. For a generic correlation, we use at least 2 buckets path. So we would have to introduce a new Aggregator, say MultipleBucketMetricsPipelineAggregator, for this case?

Another question is about missing values or nulls in the time series. We will just drop the pair if either of values is missing or null as R cor and Excel CORREL operators do, won't we?

colings86 · 2018-01-22T09:32:41Z

@panffeng note that a sibling pipeline aggregator does not need to extend BucketMetricsPipelineAggregationBuilder for the builder it can just extend AbstractPipelineAggregationBuilder. The requirement for a sibling pipeline aggregation is that the aggregator class extend SiblingPipelineAggregator

panffeng · 2018-01-23T03:26:58Z

Hi @colings86 , thanks. I will look into the class AbstractPipelineAggregationBuilder.

polyfractal · 2018-01-23T15:03:00Z

Another question is about missing values or nulls in the time series. We will just drop the pair if either of values is missing or null as R cor and Excel CORREL operators do, won't we?

I think we should follow how other pipeline aggs and use the gap_policy parameter to control the behavior. Defaulting to skip seems appropriate, and would behave as you mentioned: drop the pair if either values are missing/null.

The insert_zero gap policy probably won't get used much with this aggregation since it doesn't make sense, but it'll be consistent with the other aggs to support it. And in the future we could add more intelligent gap policies that are suited for correlation (replacing with mean, expectation-maximization, nearest-neighbor, etc etc)

panffeng · 2018-02-18T05:16:34Z

Hi @polyfractal, I wrote a very basic version with basic test here panffeng@d64f3b2

The correlation aggregator extends SiblingPipelineAggregator directly.

Multiple lags are not yet supported. Lags are used much more frequent for autocorrelation. So my plan is to compute autocorrelation with lags and to compute cross correlation with zero lag. Does this make sense?

Also, is there any coding example for the new plugin SPI you mentioned earlier? Or is it for plugins only? I want to add the implementation of FFT approach.

Thanks.

markharwood · 2018-03-15T17:26:47Z

cc @elastic/es-search-aggs

polyfractal · 2018-06-04T21:10:49Z

Hiya @panffeng, sorry for the delay. This slipped through my inbox and I didn't notice your reply.

We don't have any examples of a plugin using a pipeline aggregation, the closest is a module that adds the matrix-stats aggregation (https://github.com/elastic/elasticsearch/tree/master/modules/aggs-matrix-stats). That would contain a lot of similar boilerplate, but probably not identical. There may be parts that just don't work with trying to plug a pipeline agg in yet either... I'm not sure, would have to dig into the code closer.

I quickly skimmed your commit, only note is that if possible you could try using the newer static parser style (like this: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/search/aggregations/pipeline/movfn/MovFnPipelineAggregationBuilder.java#L62). It tends to be easier to read since it's more compact. It isn't always possible to use however, depending on what you need to do.

It'd probably be easiest to move this forward as a PR so we can comment on it directly and help out, if you're still interested in working on it.

Sorry again for the delay!

polyfractal · 2021-03-18T19:29:26Z

Closing as this seems to have stalled. If this is still of interest, feel free to open a PR and the team can revisit!

DaveCTurner added the :Analytics/Aggregations Aggregations label Dec 26, 2017

DaveCTurner assigned colings86 Dec 26, 2017

jpountz added the >feature label Dec 26, 2017

colings86 assigned polyfractal and unassigned colings86 Jan 15, 2018

rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020

$@polyfractal$ polyfractal closed this as completed Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

computing correlation for aggregated results on fly #27983

computing correlation for aggregated results on fly #27983

panffeng commented Dec 25, 2017 •

edited

Loading

polyfractal commented Jan 12, 2018

panffeng commented Jan 15, 2018

polyfractal commented Jan 16, 2018

panffeng commented Jan 21, 2018

colings86 commented Jan 22, 2018

panffeng commented Jan 23, 2018

polyfractal commented Jan 23, 2018

panffeng commented Feb 18, 2018

markharwood commented Mar 15, 2018

polyfractal commented Jun 4, 2018

polyfractal commented Mar 18, 2021

computing correlation for aggregated results on fly #27983

computing correlation for aggregated results on fly #27983

Comments

panffeng commented Dec 25, 2017 • edited Loading

polyfractal commented Jan 12, 2018

panffeng commented Jan 15, 2018

polyfractal commented Jan 16, 2018

panffeng commented Jan 21, 2018

colings86 commented Jan 22, 2018

panffeng commented Jan 23, 2018

polyfractal commented Jan 23, 2018

panffeng commented Feb 18, 2018

markharwood commented Mar 15, 2018

polyfractal commented Jun 4, 2018

polyfractal commented Mar 18, 2021

panffeng commented Dec 25, 2017 •

edited

Loading