Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't use Date or String values in bucket_selector and bucket_script pipeline aggregations #23874

Open
Tracked by #82808
colings86 opened this issue Apr 3, 2017 · 13 comments
Labels
:Analytics/Aggregations Aggregations >bug high hanging fruit stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@colings86
Copy link
Contributor

Reported via Discuss forum: https://discuss.elastic.co/t/bucket-selector-aggregation-on-date-histogram--key/80986

The bucket_script and bucket_selector aggregations use the BucketHelpers.resolveBucketValue() method to get the bucket_path values from the buckets. That method requires that the return value is a Double so currently it is required that all the bucket_paths are numeric values. This presents a problem when trying to use the _key of a bucket since the key might be a DateTime (in the case of a date_histogram or date_range aggs) or a String (in the case of a terms agg).

One thing to note here is that all current pipeline aggs except the bucket_script and bucket_selector require the value from the bucket_path to be a double, so whatever the solution to this bug we should maintain a route that guarantees a double to be returned.

I have a few thoughts on how to solve this but I don't know if which (if any) of them are a good idea yet:

  1. Change BucketHelpers.resolveBucketValue() to have a generic return type and somehow check that type is compatible before returning (not sure if this is possible since generics are not available at runtime)
  2. Add a public Object resolveBucketObject() method in BucketHelpers that just returns the value it gets from the bucket as long as its not an instance of InternalAggregation (So that its an actual value rather than an aggregation
  3. Let the bucket_script and bucket_selector aggs get the bucket path values directly using Bucket.getProperty()
@nik9000
Copy link
Member

nik9000 commented Apr 3, 2017

Bleh. I think this is a thing that'd be fixed pretty well with the script contexts we keep talking about. In that case we'd compile the script against one of a couple of interfaces (returning a double, returning a date, returning a string) and then adapt them to something useful for aggs. Or something like that.

Without them (because they aren't coming quickly) we could add a couple more instanceof checks....

@colings86
Copy link
Contributor Author

I don't really want to add the instanceof checks directly to that method since, as I mentioned, the rest of the pipeline aggregations rely on the value being a number.

@polyfractal
Copy link
Contributor

@elastic/es-search-aggs

@acarstoiu
Copy link

Ran into this and had a look at the BucketHelpers class. I'd go for solution 2 proposed above by @colings86, namely:

  • rename with an IDE the resolveBucketValue() method to resolveBucketNumericalValue() - this should modify all uses, except for the bucket_selector and bucket_script aggregations
  • inside BucketHelpers, create method resolveBucketValue() with the same parameter lists and code, except for the fact that it should merrily return anything that's not an aggregation
  • redirect the variant of resolveBucketNumericalValue() with a longer parameter list to its resolveBucketValue() counterpart, plus an instanceof Double check

@acarstoiu
Copy link

Uhm... and I would replace the "high" with a "low" in that label.

@polyfractal
Copy link
Contributor

Also related, recent discussions about exposing the string key of geo tiles: #39957 (comment)

@dfischl
Copy link

dfischl commented Aug 26, 2019

Hi! I was just curious if this is something that is still being worked on or if this is working already in ES 7.3?

@Fabiomad85
Copy link

I'm curious too

@graphaelli
Copy link
Member

Just ran into this issue while trying to calculate durations between events over time and was wondering if this is a workaround or if I'm misunderstanding the result:

Given a set of documents with a `@timestamp` field
{"index":{}}
{"@timestamp":"2020-01-01T12:00:00Z"}
{"index":{}}
{"@timestamp":"2020-01-01T12:00:00Z"}
{"index":{}}
{"@timestamp":"2020-01-01T12:10:30Z"}
{"index":{}}
{"@timestamp":"2020-01-01T13:00:00Z"}
{"index":{}}
{"@timestamp":"2020-01-01T13:00:00Z"}
{"index":{}}
{"@timestamp":"2020-01-01T13:01:00Z"}
{"index":{}}
{"@timestamp":"2020-01-01T13:02:00Z"}

this query fails:

{
  "size": 0,
  "aggs": {
    "permin": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "minute",
        "min_doc_count": 1,
        "keyed": false
      },
      "aggs": {
        "diff": {
          "serial_diff": {
            "buckets_path": "_key",
            "lag": 1
          }
        }
      }
    }
  }
}

with:

{
  "type" : "aggregation_execution_exception",
  "reason" : "buckets_path must reference either a number value or a single value numeric metric aggregation, got: [ZonedDateTime] at aggregation [_key]"
}

while this one, with a min aggregation to get the timestamp to something that will serial_diff runs without exception:

{
  "size": 0,
  "aggs": {
    "permin": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "minute",
        "min_doc_count": 1,
        "keyed": false
      },
      "aggs": {
        "minpermin": {
          "min": {
            "field": "@timestamp"
          }
        },
        "diff": {
          "serial_diff": {
            "buckets_path": "minpermin",
            "lag": 1
          }
        }
      }
    }
  }
}

@graphaelli
Copy link
Member

related to #54110

@rjernst rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020
palesz pushed a commit to palesz/elasticsearch that referenced this issue Mar 31, 2021
Previously we were not able to execute queries like the one below,
where we filter on the groupings of the `Aggregate`s. The reasons are
the limitations of the `bucket_selector`s `bucket_path` [0] that cannot
filter on non-numerical keys.

The implemented optimization transforms the following query:

```sql
SELECT i, a
FROM (
     SELECT int as i, AVG(float) as a
     FROM test
     GROUP BY int
) WHERE i = 1
```

To one equivalent to this:

```sql
SELECT i, a
FROM (
     SELECT int as i, AVG(float) as a
     FROM test
     WHERE int = 1
     GROUP BY int
)
```

The latter query can be translated to Query DSL.

The change also makes it possible to use a query with HAVING in a
subselect with filters in the outer query. We can have multiple levels
of queries on a single GROUP BY with filters on multiple levels. At
the end the filters will be collapsed, the conditions on the
aggregates will transformed into an equivalent of HAVING, while the
conditions on groupings will be pushed below the GROUP BY query.

Note: because of the limitation [0] mentioned above
(sub-)conditions that use expressions both on groupings and aggregates
of the GROUP BY and unsplittable (groupings and aggregates are not in
separate child of a conjunction) cannot be handled.

For example the following query will fail with a VerifierException:

```sql
SELECT *
FROM (
     SELECT int, count(*) as cnt
     FROM test
     GROUP BY int
)
WHERE int > 2 OR cnt > 1
```

because the `int > 2 OR cnt > 1` filters both on the groupings (`int >
2`) and the aggregates (`cnt > 1` that translates to `count(*) > 1`).

Note: Queries like

```sql
SELECT MAX(int) FROM test WHERE MAX(int) > 10
```

are still not allowed (HAVING should be used).

Also fixes elastic#69758, by making the queries work utilizing the filter
push-down, but note that there are changes in the
`EXPLAIN (PLAN ANALYZED)` outputs (`SubqueryAlias` node appears).

[0] elastic#23874
.
@salvatore-campagna salvatore-campagna self-assigned this Nov 2, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@salvatore-campagna
Copy link
Contributor

salvatore-campagna commented Nov 9, 2022

Ran into this and had a look at the BucketHelpers class. I'd go for solution 2 proposed above by @colings86, namely:

  • rename with an IDE the resolveBucketValue() method to resolveBucketNumericalValue() - this should modify all uses, except for the bucket_selector and bucket_script aggregations
  • inside BucketHelpers, create method resolveBucketValue() with the same parameter lists and code, except for the fact that it should merrily return anything that's not an aggregation
  • redirect the variant of resolveBucketNumericalValue() with a longer parameter list to its resolveBucketValue() counterpart, plus an instanceof Double check

I had a look at this trying to implement it using something very similar to the second bullet point but I see the following.

The logic that resolves the value pointed by the path (which returns a Double as it before applying the change) is returning:

  1. For single buckets aggregations: the result of calling getKey on the Internal bucket representation for each aggregation. Most of the times this getKey just delegates to getKeyAsString. This means most of the times the resolution logic returns a String which for a numeric field can just be converted to a Double (this is how returning a double works normally). One exception is represented by date histogram whose getKey returns a ZonedDateTime. As a result, it looks like we just need to support String and ZonedDateTime (other than numeric values which are normally converted to Double from a String).
  2. For multi buckets aggregations: it returns an array where each item is the result of calling getKey with the same logic as above.

As a result I believe the actual issue is in having getKey delegate to getKeyAsString. I believe having getKey return the actual subclass of Object is a better solution, whereas we can keep the getKeyAsString as the implementation returning the String representation of the bucket key. Returning an actual subclass of Object from getKey allows controlling comparison and the calling code to eventually use the getKeyAsString as a way to try to deal with Strings and the object string representation.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >bug high hanging fruit stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests