Can't use Date or String values in `bucket_selector` and `bucket_script` pipeline aggregations #23874

colings86 · 2017-04-03T14:01:42Z

Reported via Discuss forum: https://discuss.elastic.co/t/bucket-selector-aggregation-on-date-histogram--key/80986

The bucket_script and bucket_selector aggregations use the BucketHelpers.resolveBucketValue() method to get the bucket_path values from the buckets. That method requires that the return value is a Double so currently it is required that all the bucket_paths are numeric values. This presents a problem when trying to use the _key of a bucket since the key might be a DateTime (in the case of a date_histogram or date_range aggs) or a String (in the case of a terms agg).

One thing to note here is that all current pipeline aggs except the bucket_script and bucket_selector require the value from the bucket_path to be a double, so whatever the solution to this bug we should maintain a route that guarantees a double to be returned.

I have a few thoughts on how to solve this but I don't know if which (if any) of them are a good idea yet:

Change BucketHelpers.resolveBucketValue() to have a generic return type and somehow check that type is compatible before returning (not sure if this is possible since generics are not available at runtime)
Add a public Object resolveBucketObject() method in BucketHelpers that just returns the value it gets from the bucket as long as its not an instance of InternalAggregation (So that its an actual value rather than an aggregation
Let the bucket_script and bucket_selector aggs get the bucket path values directly using Bucket.getProperty()

The text was updated successfully, but these errors were encountered:

nik9000 · 2017-04-03T14:05:13Z

Bleh. I think this is a thing that'd be fixed pretty well with the script contexts we keep talking about. In that case we'd compile the script against one of a couple of interfaces (returning a double, returning a date, returning a string) and then adapt them to something useful for aggs. Or something like that.

Without them (because they aren't coming quickly) we could add a couple more instanceof checks....

colings86 · 2017-04-03T15:50:24Z

I don't really want to add the instanceof checks directly to that method since, as I mentioned, the rest of the pipeline aggregations rely on the value being a number.

polyfractal · 2018-03-22T18:05:00Z

@elastic/es-search-aggs

acarstoiu · 2019-03-16T16:39:36Z

Ran into this and had a look at the BucketHelpers class. I'd go for solution 2 proposed above by @colings86, namely:

rename with an IDE the resolveBucketValue() method to resolveBucketNumericalValue() - this should modify all uses, except for the bucket_selector and bucket_script aggregations
inside BucketHelpers, create method resolveBucketValue() with the same parameter lists and code, except for the fact that it should merrily return anything that's not an aggregation
redirect the variant of resolveBucketNumericalValue() with a longer parameter list to its resolveBucketValue() counterpart, plus an instanceof Double check

acarstoiu · 2019-03-16T16:44:02Z

Uhm... and I would replace the "high" with a "low" in that label.

polyfractal · 2019-03-18T13:44:14Z

Also related, recent discussions about exposing the string key of geo tiles: #39957 (comment)

dfischl · 2019-08-26T15:02:47Z

Hi! I was just curious if this is something that is still being worked on or if this is working already in ES 7.3?

Fabiomad85 · 2019-09-07T15:18:11Z

I'm curious too

graphaelli · 2020-03-21T01:37:24Z

Just ran into this issue while trying to calculate durations between events over time and was wondering if this is a workaround or if I'm misunderstanding the result:

Given a set of documents with a `@timestamp` field

{"index":{}}
{"@timestamp":"2020-01-01T12:00:00Z"}
{"index":{}}
{"@timestamp":"2020-01-01T12:00:00Z"}
{"index":{}}
{"@timestamp":"2020-01-01T12:10:30Z"}
{"index":{}}
{"@timestamp":"2020-01-01T13:00:00Z"}
{"index":{}}
{"@timestamp":"2020-01-01T13:00:00Z"}
{"index":{}}
{"@timestamp":"2020-01-01T13:01:00Z"}
{"index":{}}
{"@timestamp":"2020-01-01T13:02:00Z"}

this query fails:

{
  "size": 0,
  "aggs": {
    "permin": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "minute",
        "min_doc_count": 1,
        "keyed": false
      },
      "aggs": {
        "diff": {
          "serial_diff": {
            "buckets_path": "_key",
            "lag": 1
          }
        }
      }
    }
  }
}

with:

{
  "type" : "aggregation_execution_exception",
  "reason" : "buckets_path must reference either a number value or a single value numeric metric aggregation, got: [ZonedDateTime] at aggregation [_key]"
}

while this one, with a min aggregation to get the timestamp to something that will serial_diff runs without exception:

{
  "size": 0,
  "aggs": {
    "permin": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "minute",
        "min_doc_count": 1,
        "keyed": false
      },
      "aggs": {
        "minpermin": {
          "min": {
            "field": "@timestamp"
          }
        },
        "diff": {
          "serial_diff": {
            "buckets_path": "minpermin",
            "lag": 1
          }
        }
      }
    }
  }
}

graphaelli · 2020-03-26T18:54:32Z

related to #54110

Previously we were not able to execute queries like the one below, where we filter on the groupings of the `Aggregate`s. The reasons are the limitations of the `bucket_selector`s `bucket_path` [0] that cannot filter on non-numerical keys. The implemented optimization transforms the following query: ```sql SELECT i, a FROM ( SELECT int as i, AVG(float) as a FROM test GROUP BY int ) WHERE i = 1 ``` To one equivalent to this: ```sql SELECT i, a FROM ( SELECT int as i, AVG(float) as a FROM test WHERE int = 1 GROUP BY int ) ``` The latter query can be translated to Query DSL. The change also makes it possible to use a query with HAVING in a subselect with filters in the outer query. We can have multiple levels of queries on a single GROUP BY with filters on multiple levels. At the end the filters will be collapsed, the conditions on the aggregates will transformed into an equivalent of HAVING, while the conditions on groupings will be pushed below the GROUP BY query. Note: because of the limitation [0] mentioned above (sub-)conditions that use expressions both on groupings and aggregates of the GROUP BY and unsplittable (groupings and aggregates are not in separate child of a conjunction) cannot be handled. For example the following query will fail with a VerifierException: ```sql SELECT * FROM ( SELECT int, count(*) as cnt FROM test GROUP BY int ) WHERE int > 2 OR cnt > 1 ``` because the `int > 2 OR cnt > 1` filters both on the groupings (`int > 2`) and the aggregates (`cnt > 1` that translates to `count(*) > 1`). Note: Queries like ```sql SELECT MAX(int) FROM test WHERE MAX(int) > 10 ``` are still not allowed (HAVING should be used). Also fixes elastic#69758, by making the queries work utilizing the filter push-down, but note that there are changes in the `EXPLAIN (PLAN ANALYZED)` outputs (`SubqueryAlias` node appears). [0] elastic#23874 .

elasticsearchmachine · 2022-11-02T10:05:48Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

salvatore-campagna · 2022-11-09T14:06:30Z

Ran into this and had a look at the BucketHelpers class. I'd go for solution 2 proposed above by @colings86, namely:

rename with an IDE the resolveBucketValue() method to resolveBucketNumericalValue() - this should modify all uses, except for the bucket_selector and bucket_script aggregations

inside BucketHelpers, create method resolveBucketValue() with the same parameter lists and code, except for the fact that it should merrily return anything that's not an aggregation

redirect the variant of resolveBucketNumericalValue() with a longer parameter list to its resolveBucketValue() counterpart, plus an instanceof Double check

I had a look at this trying to implement it using something very similar to the second bullet point but I see the following.

The logic that resolves the value pointed by the path (which returns a Double as it before applying the change) is returning:

For single buckets aggregations: the result of calling getKey on the Internal bucket representation for each aggregation. Most of the times this getKey just delegates to getKeyAsString. This means most of the times the resolution logic returns a String which for a numeric field can just be converted to a Double (this is how returning a double works normally). One exception is represented by date histogram whose getKey returns a ZonedDateTime. As a result, it looks like we just need to support String and ZonedDateTime (other than numeric values which are normally converted to Double from a String).
For multi buckets aggregations: it returns an array where each item is the result of calling getKey with the same logic as above.

As a result I believe the actual issue is in having getKey delegate to getKeyAsString. I believe having getKey return the actual subclass of Object is a better solution, whereas we can keep the getKeyAsString as the implementation returning the String representation of the bucket key. Returning an actual subclass of Object from getKey allows controlling comparison and the calling code to eventually use the getKeyAsString as a way to try to deal with Strings and the object string representation.

elasticsearchmachine · 2024-07-02T19:34:47Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

colings86 added :Analytics/Aggregations Aggregations >bug discuss v5.4.0 v6.0.0-alpha1 labels Apr 3, 2017

clintongormley added v5.4.1 v6.0.0 and removed v5.4.0 v6.0.0-alpha1 labels Apr 28, 2017

clintongormley added v5.4.2 and removed v5.4.1 labels May 15, 2017

colings86 mentioned this issue May 25, 2017

Allow use of string variables in the bucket_script and bucket_selector aggregations #20694

Closed

clintongormley added v5.4.3 and removed v5.4.2 labels Jun 14, 2017

clintongormley added v5.4.4 and removed v5.4.3 labels Jun 27, 2017

lcawl added v6.0.1 and removed v6.0.0 labels Nov 13, 2017

lcawl added v6.0.2 and removed v6.0.1 labels Dec 6, 2017

jaymode added v6.0.3 and removed v6.0.2 labels Dec 13, 2017

$@polyfractal$ polyfractal removed v5.4.4 v6.0.3 labels Mar 22, 2018

tomcallahan added team-discuss and removed discuss labels Jul 5, 2018

colings86 added the high hanging fruit label Jul 16, 2018

colings86 removed the team-discuss label Jul 16, 2018

jcornez mentioned this issue May 17, 2019

Add more flexibility to MovingFunction window alignment #42181

Closed

rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020

palesz mentioned this issue Mar 31, 2021

SQL: Implement filter push-down for GROUP BYs #71116

Closed

wchaparro mentioned this issue Jan 19, 2022

Meta: Refactor pipeline aggregations #82808

Open

15 tasks

salvatore-campagna self-assigned this Nov 2, 2022

wchaparro added the stalled label Jul 2, 2024

wchaparro unassigned salvatore-campagna Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't use Date or String values in `bucket_selector` and `bucket_script` pipeline aggregations #23874

Can't use Date or String values in `bucket_selector` and `bucket_script` pipeline aggregations #23874

colings86 commented Apr 3, 2017

nik9000 commented Apr 3, 2017

colings86 commented Apr 3, 2017

polyfractal commented Mar 22, 2018

acarstoiu commented Mar 16, 2019

acarstoiu commented Mar 16, 2019

polyfractal commented Mar 18, 2019

dfischl commented Aug 26, 2019

Fabiomad85 commented Sep 7, 2019

graphaelli commented Mar 21, 2020

graphaelli commented Mar 26, 2020

elasticsearchmachine commented Nov 2, 2022

salvatore-campagna commented Nov 9, 2022 •

edited

Loading

elasticsearchmachine commented Jul 2, 2024

Can't use Date or String values in bucket_selector and bucket_script pipeline aggregations #23874

Can't use Date or String values in bucket_selector and bucket_script pipeline aggregations #23874

Comments

colings86 commented Apr 3, 2017

nik9000 commented Apr 3, 2017

colings86 commented Apr 3, 2017

polyfractal commented Mar 22, 2018

acarstoiu commented Mar 16, 2019

acarstoiu commented Mar 16, 2019

polyfractal commented Mar 18, 2019

dfischl commented Aug 26, 2019

Fabiomad85 commented Sep 7, 2019

graphaelli commented Mar 21, 2020

graphaelli commented Mar 26, 2020

elasticsearchmachine commented Nov 2, 2022

salvatore-campagna commented Nov 9, 2022 • edited Loading

elasticsearchmachine commented Jul 2, 2024

Can't use Date or String values in `bucket_selector` and `bucket_script` pipeline aggregations #23874

Can't use Date or String values in `bucket_selector` and `bucket_script` pipeline aggregations #23874

salvatore-campagna commented Nov 9, 2022 •

edited

Loading