Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Implement FIRST/LAST aggregate functions #37936

Merged
merged 11 commits into from
Jan 31, 2019
Merged

Conversation

matriv
Copy link
Contributor

@matriv matriv commented Jan 28, 2019

FIRST and LAST can be used with one argument and work similarly to MIN
and MAX but they are implemented using a Top Hits aggregation and
therefore can also operate on keyword fields. When a second argument is
provided then they return the first/last value of the first arg when its
values are ordered ascending/descending (respectively) by the values of
the second argument. Currently because of the usage of a Top Hits
aggregation FIRST and LAST cannot be used in the HAVING clause of a
GROUP BY query to filter on the results of the aggregation.

Closes: #35639

FIRST and LAST can be used with one argument and work similarly to MIN
and MAX but they are implemented using a Top Hits aggregation and
therefore can also operate on keyword fields. When a second argument is
provided then they return the first/last value of the first arg when its
values are ordered ascending/descending (respectively) by the values of
the second argument. Currently because of the usage of a Top Hits
aggregation FIRST and LAST cannot be used in the HAVING clause of a
GROUP BY query to filter on the results of the aggregation.

Closes: elastic#35639
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall however there some things that need improving.
In particular the rewrite optimization I don't think is complete and probably needs a follow-up PR.


.Synopsis:
[source, sql]
--------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for optional arguments, it would be nice to add an example with FIRST just with the first argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, for synopsis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@costin our documentation, until now, lists one version of a function in synopsis where the optional arguments are between square brackets. See https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/string.asciidoc#locate and https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/math.asciidoc#round (and others). Listing all versions of a function (with/without optional params) imo would look a bit crowded and not-so-clean and sleek. In this particular case of FIRST/FIRST_VALUE/LAST/LAST_VALUE means we need to list four versions of the same function in the synopsis and I personally believe it doesn't look that great.

Do you believe that, maybe, it is not clear that parameter is optional? (the square brackets and the description saying "optional")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the optional parameter but the method name that I'd like to be exemplified through a query to make the function name more obvious.
It might just be easier to create a separate section for the alias, mentioning that it's an alias for ...
This can be done in a separate PR though since it's just documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.Synopsis:
[source, sql]
--------------------------------------------------
FIRST(field_name<1>, sort_by_field_name<2>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort_by_field_name -> ordering field


*Input*:

<1> a field name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<1> target field for the aggregation
<2> optional field used for ordering


[cols="<,<"]
|===
s|a|b
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use indentation to have the table columns aligned.

@@ -113,6 +113,129 @@ Returns the total number of _distinct non-null_ values in input values.
include-tagged::{sql-specs}/docs.csv-spec[aggCountDistinct]
--------------------------------------------------

[[sql-functions-aggs-first]]
===== `FIRST/FIRST_VALUE`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally please add an example for FIRST_VALUE.

return parameters().isEmpty() ? null : parameters().get(0);
}

public DataType sortFieldDataType() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this method is needed - one can always say sortField().dataType().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that to avoid the caller checking if the sortField() is null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All expressions return only one datatype - of their result, adding a new method is confusing. It's simple to do the null check - in fact, it's is mandatory for sorting since one cannot just add a null, null directive (if the sort field would be null).

super(source, field, sortField != null ? Collections.singletonList(sortField) : Collections.emptyList());
}

public Expression sortField() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why sort is used instead of order/orderby?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, can change.

AggregationBuilder toBuilder() {
// Sort missing values (NULLs) as last to get the first/last non-null value
List<SortBuilder<?>> sortBuilderList;
if (sortField != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is redundant since the field ordering code is duplicated.

if (sortField!=null) {
  sort.add(fieldSortBuilder(sortField);
}

sort.add(fieldSortBuilder(fieldName());

// Sort missing values (NULLs) as last to get the first/last non-null value
List<SortBuilder<?>> sortBuilderList;
if (sortField != null) {
sortBuilderList = Arrays.asList(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To increase readability please use one method per line.

@@ -240,4 +243,8 @@ public static DataType fromTypeName(String esType) {
return DataType.UNSUPPORTED;
}
}

public String format() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not okay conceptually and implementation wise since the base class is aware of its child type.
Make format() return DocValueFieldsContext by default and override DateTime and Date to return DATE_PARSE_FORMA - basic overriding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is an enum, I don't get it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rephrase this - a dataType can have different formats. Currently we store that not in the DataType but rather inside the DateEsField as it can vary from field to field.
Ironically due to our usage of field_caps this information is not accessible anymore however it might be in the future.
DateEsField allows a custom format to be specified though I'm not sure this is something we need to use necessarily.

If the format could be used on DateEsField I would opt for that, if not, it can stay as is in the DataType but the two classes need to be aligned so the format is defined and stored only in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getFormats() on DateEsField is currently not used anywhere so I'll remove it from there to avoid any confusion.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Left few minor comments.
Also, I would like to see tests for all values NULL. Something like SELECT FIRST(whatever) FROM bla WHERE whatever IS NULL GROUP BY X.
Also, tests that catch the error generated when FIRST/LAST/MIN/MAX is used in HAVING.

.Synopsis:
[source, sql]
--------------------------------------------------
FIRST(field_name<1>, sort_by_field_name<2>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sort_by_field_name is optional, I think you need to put it between square brackets to remain consistent with the rest of the documentation.


.Description:

When only one argument is provided it returns the first **non-NULL** value across input values in the field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"When only the first argument is provided".... (only the second one is optional).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first non-null value is returned regardless of the presence of the optional argument.
I would rephrase this to "Returns the first non-NULL value (if it exists) of the input column sorted by the sort_column. If sort_column is missing, the field_name column is used for sorting`. or something along those lines.

-----------------------------------------------------------

[NOTE]
`FIRST` cannot be used in to create a filter in a `HAVING` clause of a `GROUP BY` query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "in" is not needed here?
Also, worth putting this as a limitation in the docs? (meaning, in the dedicated Limitations page)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIRST cannot be used in a HAVING clause (no need to specify GROUP BY as that's understood and also might be implicit).

.Synopsis:
[source, sql]
--------------------------------------------------
LAST(field_name<1>, sort_by_field_name<2>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above for optional fields and square brackets.


.Description:

It's the inverse of <<sql-functions-aggs-first>>. When only one argument is provided it returns the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"first argument" instead of "one argument".


import static org.elasticsearch.common.logging.LoggerMessageFormat.format;

public abstract class TopHits extends AggregateFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add a comment describing what the class is used for.

List<SortBuilder<?>> sortBuilderList;
if (sortField != null) {
sortBuilderList = Arrays.asList(
new FieldSortBuilder(sortField).order(sortOrder).missing(Sort.Missing.LAST.position())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe import static for LAST would make the code a bit more readable.

}
TopHitsAgg that = (TopHitsAgg) o;
return Objects.equals(sortField, that.sortField) &&
sortOrder == that.sortOrder &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picky, and I apologize, but please put the && at start of the line. I think this is the consistent approach for this in code overall.

@@ -161,6 +284,9 @@ Returns the minimum value across input values in the field `field_name`.
include-tagged::{sql-specs}/docs.csv-spec[aggMin]
--------------------------------------------------

[NOTE]
`MIN` on a field of type <<text, `text`>> or <<keyword, `keyword`>> is translated into <<sql-functions-aggs-first>>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'm assuming MIN/MAX on keywords cannot be used in HAVINGs...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why I made the link here. Do you think I should add the info about HAVING here too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a new batch of comments.

@matriv
Copy link
Contributor Author

matriv commented Jan 30, 2019

@costin @astefan Thank you for the review. Addressed comments.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments.

FIRST_VALUE(field_name<1>)

FIRST(field_name<1>, ordering_field_name<2>)
FIRST_VALUE(field_name<1>, ordering_field_name<2>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Square brackets for optional fields, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now we list both with 1 arg and 2args, do you think is still necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do it like this (all versions listed - with/without the optional parameter), our documentation will be inconsistent: https://github.com/elastic/elasticsearch/blob/master/docs/reference/sql/functions/string.asciidoc#locate. If I would have to choose between the two approaches, I would go with [ ].

Copy link
Contributor Author

@matriv matriv Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion you either have one signature with the 2nd arg in square brackets or you list the 2 variants. Personally I prefer the one signature and the square brackets. @costin what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opt for the signature with the second arguments in square brackets.


*Input*:

<1> target field for the aggregation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"aggregation" refers to technical implementation in the background. My personal approach would be not to expose this, but try to explain what the field is used for from the end-user perspective. Again, personal preference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily - for the aggregate function (aggregation is ES terminology, aggregate is SQL).

LAST_VALUE(field_name<1>)

LAST(field_name<1>, ordering_field_name<2>)
LAST_VALUE(field_name<1>, ordering_field_name<2>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Square brackets for optional fields, please.

@@ -669,7 +669,7 @@ public void testTopHitsAggregationWithTwoArgs() {
assertTrue(eqe.output().get(0).dataType() == DataType.KEYWORD);
assertThat(eqe.queryContainer().aggs().asAggBuilder().toString().replaceAll("\\s+", ""),
endsWith("\"top_hits\":{\"from\":0,\"size\":1,\"version\":false,\"seq_no_primary_term\":false," +
"\"explain\":false,\"docvalue_fields\":[{\"field\":\"keyword\",\"format\":\"use_field_mapping\"}]," +
"\"explain\":false,\"docvalue_fields\":[{\"field\":\"keyword\"}]," +
"\"sort\":[{\"int\":{\"order\":\"desc\",\"missing\":\"_last\",\"unmapped_type\":\"integer\"}}," +
"{\"keyword\":{\"order\":\"desc\",\"missing\":\"_last\",\"unmapped_type\":\"keyword\"}}]}}}}}"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot wrap my head around missing option for top_hits in the context of first/last/min/max... It's about how null/missing values are treated when sorting: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html#_missing_values. Is it ok to keep missing with the default (_last), or I'm missing something, or there are no tests that have missing: _first scenario included...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I clicked "add comment" I realized these are about first/last non-NULL values. So yeah, we don't care about missing and we like the fact that the default is _last.

@matriv matriv merged commit 4710a74 into elastic:master Jan 31, 2019
@matriv matriv deleted the mt/impl-35639 branch January 31, 2019 14:33
matriv added a commit that referenced this pull request Jan 31, 2019
FIRST and LAST can be used with one argument and work similarly to MIN
and MAX but they are implemented using a Top Hits aggregation and
therefore can also operate on keyword fields. When a second argument is
provided then they return the first/last value of the first arg when its
values are ordered ascending/descending (respectively) by the values of
the second argument. Currently because of the usage of a Top Hits
aggregation FIRST and LAST cannot be used in the HAVING clause of a
GROUP BY query to filter on the results of the aggregation.

Closes: #35639
@matriv
Copy link
Contributor Author

matriv commented Jan 31, 2019

Backported to 6.x with 0b794d4

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 31, 2019
…ersion

* elastic/master:
  Do not set up NodeAndClusterIdStateListener in test (elastic#38110)
  ML: better handle task state race condition (elastic#38040)
  Soft-deletes policy should always fetch latest leases (elastic#37940)
  Handle scheduler exceptions (elastic#38014)
  Minor logging improvements (elastic#38084)
  Fix Painless void return bug (elastic#38046)
  Update PutFollowAction serialization post-backport (elastic#37989)
  fix a few versionAdded values in ElasticsearchExceptions (elastic#37877)
  Reenable BWC tests after backport of elastic#37899 (elastic#38093)
  Mute failing test
  Mute failing test
  Fail start on obsolete indices documentation (elastic#37786)
  SQL: Implement FIRST/LAST aggregate functions (elastic#37936)
  Un-mute NoMasterNodeIT.testNoMasterActionsWriteMasterBlock
  remove unused parser fields in RemoteResponseParsers
matriv added a commit to matriv/elasticsearch that referenced this pull request Apr 16, 2019
Although the translation rule was implemented in the `Optimizer`,
the rule was not added in the list of rules to be executed.

Relates to elastic#41195
Follows elastic#37936
matriv added a commit that referenced this pull request Apr 16, 2019
Although the translation rule was implemented in the `Optimizer`,
the rule was not added in the list of rules to be executed.

Relates to #41195
Follows #37936
matriv added a commit that referenced this pull request Apr 16, 2019
Although the translation rule was implemented in the `Optimizer`,
the rule was not added in the list of rules to be executed.

Relates to #41195
Follows #37936


(cherry picked from commit f426a33)
matriv added a commit that referenced this pull request Apr 16, 2019
Although the translation rule was implemented in the `Optimizer`,
the rule was not added in the list of rules to be executed.

Relates to #41195
Follows #37936


(cherry picked from commit f426a33)
matriv added a commit that referenced this pull request Apr 16, 2019
Although the translation rule was implemented in the `Optimizer`,
the rule was not added in the list of rules to be executed.

Relates to #41195
Follows #37936


(cherry picked from commit f426a33)
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
Although the translation rule was implemented in the `Optimizer`,
the rule was not added in the list of rules to be executed.

Relates to elastic#41195
Follows elastic#37936
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants