SQL: Implement FIRST/LAST aggregate functions #37936

matriv · 2019-01-28T16:57:33Z

FIRST and LAST can be used with one argument and work similarly to MIN
and MAX but they are implemented using a Top Hits aggregation and
therefore can also operate on keyword fields. When a second argument is
provided then they return the first/last value of the first arg when its
values are ordered ascending/descending (respectively) by the values of
the second argument. Currently because of the usage of a Top Hits
aggregation FIRST and LAST cannot be used in the HAVING clause of a
GROUP BY query to filter on the results of the aggregation.

Closes: #35639

FIRST and LAST can be used with one argument and work similarly to MIN and MAX but they are implemented using a Top Hits aggregation and therefore can also operate on keyword fields. When a second argument is provided then they return the first/last value of the first arg when its values are ordered ascending/descending (respectively) by the values of the second argument. Currently because of the usage of a Top Hits aggregation FIRST and LAST cannot be used in the HAVING clause of a GROUP BY query to filter on the results of the aggregation. Closes: elastic#35639

elasticmachine · 2019-01-28T16:57:36Z

Pinging @elastic/es-search

costin

Looks good overall however there some things that need improving.
In particular the rewrite optimization I don't think is complete and probably needs a follow-up PR.

costin · 2019-01-28T23:00:36Z

docs/reference/sql/functions/aggs.asciidoc

+
+.Synopsis:
+[source, sql]
+--------------------------------------------------


I think for optional arguments, it would be nice to add an example with FIRST just with the first argument.

You mean here in the synopsis? because we have an example here: https://github.com/elastic/elasticsearch/pull/37936/files/1be7ba194c950ab175ca0e35d927ab0b54306be2#diff-f71f0c2cd27202241aec808b033b8510R166

Right, for synopsis.

@costin our documentation, until now, lists one version of a function in synopsis where the optional arguments are between square brackets. See https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/string.asciidoc#locate and https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/math.asciidoc#round (and others). Listing all versions of a function (with/without optional params) imo would look a bit crowded and not-so-clean and sleek. In this particular case of FIRST/FIRST_VALUE/LAST/LAST_VALUE means we need to list four versions of the same function in the synopsis and I personally believe it doesn't look that great.

Do you believe that, maybe, it is not clear that parameter is optional? (the square brackets and the description saying "optional")

It's not the optional parameter but the method name that I'd like to be exemplified through a query to make the function name more obvious.
It might just be easier to create a separate section for the alias, mentioning that it's an alias for ...
This can be done in a separate PR though since it's just documentation.

Added here: https://github.com/elastic/elasticsearch/pull/37936/files#diff-f71f0c2cd27202241aec808b033b8510R198

costin · 2019-01-28T23:01:09Z

docs/reference/sql/functions/aggs.asciidoc

+.Synopsis:
+[source, sql]
+--------------------------------------------------
+FIRST(field_name<1>, sort_by_field_name<2>)


sort_by_field_name -> ordering field

costin · 2019-01-28T23:03:02Z

docs/reference/sql/functions/aggs.asciidoc

+
+*Input*:
+
+<1> a field name


<1> target field for the aggregation
<2> optional field used for ordering

costin · 2019-01-28T23:03:46Z

docs/reference/sql/functions/aggs.asciidoc

+
+[cols="<,<"]
+|===
+s|a|b


Please use indentation to have the table columns aligned.

costin · 2019-01-28T23:04:48Z

docs/reference/sql/functions/aggs.asciidoc

 --------------------------------------------------

+[[sql-functions-aggs-first]]
+===== `FIRST/FIRST_VALUE`


Additionally please add an example for FIRST_VALUE.

costin · 2019-01-28T23:25:03Z

...gin/sql/src/main/java/org/elasticsearch/xpack/sql/expression/function/aggregate/TopHits.java

+        return parameters().isEmpty() ? null : parameters().get(0);
+    }
+
+    public DataType sortFieldDataType() {


I don't think this method is needed - one can always say sortField().dataType().

I did that to avoid the caller checking if the sortField() is null.

All expressions return only one datatype - of their result, adding a new method is confusing. It's simple to do the null check - in fact, it's is mandatory for sorting since one cannot just add a null, null directive (if the sort field would be null).

costin · 2019-01-28T23:25:20Z

...gin/sql/src/main/java/org/elasticsearch/xpack/sql/expression/function/aggregate/TopHits.java

+        super(source, field, sortField != null ? Collections.singletonList(sortField) : Collections.emptyList());
+    }
+
+    public Expression sortField() {


Any reason why sort is used instead of order/orderby?

Not really, can change.

costin · 2019-01-28T23:27:40Z

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/querydsl/agg/TopHitsAgg.java

+    AggregationBuilder toBuilder() {
+        // Sort missing values (NULLs) as last to get the first/last non-null value
+        List<SortBuilder<?>> sortBuilderList;
+        if (sortField != null) {


The code here is redundant since the field ordering code is duplicated.

if (sortField!=null) { sort.add(fieldSortBuilder(sortField); } sort.add(fieldSortBuilder(fieldName());

costin · 2019-01-28T23:28:11Z

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/querydsl/agg/TopHitsAgg.java

+        // Sort missing values (NULLs) as last to get the first/last non-null value
+        List<SortBuilder<?>> sortBuilderList;
+        if (sortField != null) {
+            sortBuilderList = Arrays.asList(


To increase readability please use one method per line.

costin · 2019-01-28T23:30:02Z

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/type/DataType.java

        }
    }
+
+    public String format() {


Not okay conceptually and implementation wise since the base class is aware of its child type.
Make format() return DocValueFieldsContext by default and override DateTime and Date to return DATE_PARSE_FORMA - basic overriding.

But this is an enum, I don't get it.

Let me rephrase this - a dataType can have different formats. Currently we store that not in the DataType but rather inside the DateEsField as it can vary from field to field.
Ironically due to our usage of field_caps this information is not accessible anymore however it might be in the future.
DateEsField allows a custom format to be specified though I'm not sure this is something we need to use necessarily.

If the format could be used on DateEsField I would opt for that, if not, it can stay as is in the DataType but the two classes need to be aligned so the format is defined and stored only in one place.

The getFormats() on DateEsField is currently not used anywhere so I'll remove it from there to avoid any confusion.

astefan

Looking good. Left few minor comments.
Also, I would like to see tests for all values NULL. Something like SELECT FIRST(whatever) FROM bla WHERE whatever IS NULL GROUP BY X.
Also, tests that catch the error generated when FIRST/LAST/MIN/MAX is used in HAVING.

astefan · 2019-01-29T08:16:26Z

docs/reference/sql/functions/aggs.asciidoc

+.Synopsis:
+[source, sql]
+--------------------------------------------------
+FIRST(field_name<1>, sort_by_field_name<2>)


If sort_by_field_name is optional, I think you need to put it between square brackets to remain consistent with the rest of the documentation.

astefan · 2019-01-29T08:17:50Z

docs/reference/sql/functions/aggs.asciidoc

+
+.Description:
+
+When only one argument is provided it returns the first **non-NULL** value across input values in the field


"When only the first argument is provided".... (only the second one is optional).

The first non-null value is returned regardless of the presence of the optional argument.
I would rephrase this to "Returns the first non-NULL value (if it exists) of the input column sorted by the sort_column. If sort_column is missing, the field_name column is used for sorting`. or something along those lines.

astefan · 2019-01-29T08:19:23Z

docs/reference/sql/functions/aggs.asciidoc

+-----------------------------------------------------------
+
+[NOTE]
+`FIRST` cannot be used in to create a filter in a `HAVING` clause of a `GROUP BY` query.


I think "in" is not needed here?
Also, worth putting this as a limitation in the docs? (meaning, in the dedicated Limitations page)

FIRST cannot be used in a HAVING clause (no need to specify GROUP BY as that's understood and also might be implicit).

astefan · 2019-01-29T09:02:42Z

docs/reference/sql/functions/aggs.asciidoc

+.Synopsis:
+[source, sql]
+--------------------------------------------------
+LAST(field_name<1>, sort_by_field_name<2>)


Same as above for optional fields and square brackets.

astefan · 2019-01-29T09:03:10Z

docs/reference/sql/functions/aggs.asciidoc

+
+.Description:
+
+It's the inverse of <<sql-functions-aggs-first>>. When only one argument is provided it returns the


"first argument" instead of "one argument".

...rc/main/java/org/elasticsearch/xpack/sql/execution/search/extractor/TopHitsAggExtractor.java

astefan · 2019-01-29T09:45:58Z

...gin/sql/src/main/java/org/elasticsearch/xpack/sql/expression/function/aggregate/TopHits.java

+
+import static org.elasticsearch.common.logging.LoggerMessageFormat.format;
+
+public abstract class TopHits extends AggregateFunction {


Please, add a comment describing what the class is used for.

astefan · 2019-01-29T10:40:19Z

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/querydsl/agg/TopHitsAgg.java

+        List<SortBuilder<?>> sortBuilderList;
+        if (sortField != null) {
+            sortBuilderList = Arrays.asList(
+                new FieldSortBuilder(sortField).order(sortOrder).missing(Sort.Missing.LAST.position())


And maybe import static for LAST would make the code a bit more readable.

astefan · 2019-01-29T10:41:47Z

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/querydsl/agg/TopHitsAgg.java

+        }
+        TopHitsAgg that = (TopHitsAgg) o;
+        return Objects.equals(sortField, that.sortField) &&
+            sortOrder == that.sortOrder &&


Picky, and I apologize, but please put the && at start of the line. I think this is the consistent approach for this in code overall.

astefan · 2019-01-29T11:01:02Z

docs/reference/sql/functions/aggs.asciidoc

 --------------------------------------------------

+[NOTE]
+`MIN` on a field of type <<text, `text`>> or <<keyword, `keyword`>> is translated into <<sql-functions-aggs-first>>.


And I'm assuming MIN/MAX on keywords cannot be used in HAVINGs...

Yes, that's why I made the link here. Do you think I should add the info about HAVING here too?

Yes, please.

costin

Left a new batch of comments.

matriv · 2019-01-30T19:09:59Z

@costin @astefan Thank you for the review. Addressed comments.

astefan

Left some minor comments.

astefan · 2019-01-31T07:32:05Z

docs/reference/sql/functions/aggs.asciidoc

+FIRST_VALUE(field_name<1>)
+
+FIRST(field_name<1>, ordering_field_name<2>)
+FIRST_VALUE(field_name<1>, ordering_field_name<2>)


Square brackets for optional fields, please.

But now we list both with 1 arg and 2args, do you think is still necessary?

If we do it like this (all versions listed - with/without the optional parameter), our documentation will be inconsistent: https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/string.asciidoc#locate. If I would have to choose between the two approaches, I would go with [ ].

In my opinion you either have one signature with the 2nd arg in square brackets or you list the 2 variants. Personally I prefer the one signature and the square brackets. @costin what do you think?

I opt for the signature with the second arguments in square brackets.

astefan · 2019-01-31T07:35:11Z

docs/reference/sql/functions/aggs.asciidoc

+
+*Input*:
+
+<1> target field for the aggregation


"aggregation" refers to technical implementation in the background. My personal approach would be not to expose this, but try to explain what the field is used for from the end-user perspective. Again, personal preference.

Not necessarily - for the aggregate function (aggregation is ES terminology, aggregate is SQL).

astefan · 2019-01-31T07:35:43Z

docs/reference/sql/functions/aggs.asciidoc

+LAST_VALUE(field_name<1>)
+
+LAST(field_name<1>, ordering_field_name<2>)
+LAST_VALUE(field_name<1>, ordering_field_name<2>)


Square brackets for optional fields, please.

astefan · 2019-01-31T07:53:40Z

x-pack/plugin/sql/src/test/java/org/elasticsearch/xpack/sql/planner/QueryTranslatorTests.java

-                    "\"explain\":false,\"docvalue_fields\":[{\"field\":\"keyword\",\"format\":\"use_field_mapping\"}]," +
+                    "\"explain\":false,\"docvalue_fields\":[{\"field\":\"keyword\"}]," +
                    "\"sort\":[{\"int\":{\"order\":\"desc\",\"missing\":\"_last\",\"unmapped_type\":\"integer\"}}," +
                    "{\"keyword\":{\"order\":\"desc\",\"missing\":\"_last\",\"unmapped_type\":\"keyword\"}}]}}}}}"));


I cannot wrap my head around missing option for top_hits in the context of first/last/min/max... It's about how null/missing values are treated when sorting: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html#_missing_values. Is it ok to keep missing with the default (_last), or I'm missing something, or there are no tests that have missing: _first scenario included...?

After I clicked "add comment" I realized these are about first/last non-NULL values. So yeah, we don't care about missing and we like the fact that the default is _last.

FIRST and LAST can be used with one argument and work similarly to MIN and MAX but they are implemented using a Top Hits aggregation and therefore can also operate on keyword fields. When a second argument is provided then they return the first/last value of the first arg when its values are ordered ascending/descending (respectively) by the values of the second argument. Currently because of the usage of a Top Hits aggregation FIRST and LAST cannot be used in the HAVING clause of a GROUP BY query to filter on the results of the aggregation. Closes: #35639

matriv · 2019-01-31T15:29:29Z

Backported to 6.x with 0b794d4

…ersion * elastic/master: Do not set up NodeAndClusterIdStateListener in test (elastic#38110) ML: better handle task state race condition (elastic#38040) Soft-deletes policy should always fetch latest leases (elastic#37940) Handle scheduler exceptions (elastic#38014) Minor logging improvements (elastic#38084) Fix Painless void return bug (elastic#38046) Update PutFollowAction serialization post-backport (elastic#37989) fix a few versionAdded values in ElasticsearchExceptions (elastic#37877) Reenable BWC tests after backport of elastic#37899 (elastic#38093) Mute failing test Mute failing test Fail start on obsolete indices documentation (elastic#37786) SQL: Implement FIRST/LAST aggregate functions (elastic#37936) Un-mute NoMasterNodeIT.testNoMasterActionsWriteMasterBlock remove unused parser fields in RemoteResponseParsers

Although the translation rule was implemented in the `Optimizer`, the rule was not added in the list of rules to be executed. Relates to elastic#41195 Follows elastic#37936

Although the translation rule was implemented in the `Optimizer`, the rule was not added in the list of rules to be executed. Relates to #41195 Follows #37936

Although the translation rule was implemented in the `Optimizer`, the rule was not added in the list of rules to be executed. Relates to #41195 Follows #37936 (cherry picked from commit f426a33)

Although the translation rule was implemented in the `Optimizer`, the rule was not added in the list of rules to be executed. Relates to elastic#41195 Follows elastic#37936

matriv added >feature v7.0.0 :Analytics/SQL SQL querying v6.7.0 labels Jan 28, 2019

matriv requested review from astefan and costin January 28, 2019 16:57

costin requested changes Jan 28, 2019

View reviewed changes

astefan reviewed Jan 29, 2019

View reviewed changes

costin reviewed Jan 30, 2019

View reviewed changes

matriv added 3 commits January 30, 2019 20:49

Address comments

3dd561a

Merge remote-tracking branch 'upstream/master' into mt/impl-35639

cea52e1

Fix after master merged

43092ad

matriv added 2 commits January 30, 2019 21:13

fixup! Fix after master merged

cd4df06

fixup! fixup! Fix after master merged

f2a516d

astefan reviewed Jan 31, 2019

View reviewed changes

astefan approved these changes Jan 31, 2019

View reviewed changes

costin approved these changes Jan 31, 2019

View reviewed changes

matriv and others added 5 commits January 31, 2019 11:28

Address comments

167ecf6

Merge remote-tracking branch 'upstream/master' into mt/impl-35639

83daf26

Add limitations

fea067f

Remove reference to tophits

04c0e14

Remove ref to fielddata

2ca0724

matriv merged commit 4710a74 into elastic:master Jan 31, 2019

matriv deleted the mt/impl-35639 branch January 31, 2019 14:33

costin mentioned this pull request Feb 2, 2019

SQL: Fix esType for DATETIME/DATE and INTERVALS #38179

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

matriv mentioned this pull request Apr 16, 2019

SQL: Translate MIN/MAX on keyword fields as FIRST/LAST #41247

Merged


		.Description:

		When only one argument is provided it returns the first non-NULL value across input values in the field


		.Description:

		It's the inverse of <<sql-functions-aggs-first>>. When only one argument is provided it returns the


		import static org.elasticsearch.common.logging.LoggerMessageFormat.format;

		public abstract class TopHits extends AggregateFunction {

SQL: Implement FIRST/LAST aggregate functions #37936

SQL: Implement FIRST/LAST aggregate functions #37936

Uh oh!

Conversation

matriv commented Jan 28, 2019

Uh oh!

elasticmachine commented Jan 28, 2019

Uh oh!

costin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

matriv Jan 31, 2019 •

edited

Loading