Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Failing Group By queries due to different ExpressionIds #43072

Merged
merged 14 commits into from
Oct 31, 2019

Conversation

emasab
Copy link
Contributor

@emasab emasab commented Jun 10, 2019

Fix an issue that arises from the use of ExpressionIds as keys in a lookup map
that helps the QueryTranslator to identify the grouping columns. The issue is
that the same expression in different parts of the query (SELECT clause and GROUP BY clause)
ends up with different ExpressionIds so the lookup fails. So, instead of ExpressionIds
use the hashCode() of NamedExpression.

Fixes: #41159
Fixes: #40001
Fixes: #40240
Fixes: #33361
Fixes: #46316
Fixes: #36074
Fixes: #34543
Fixes: #37044

Fixes: #42041

@emasab emasab changed the title bugfix for https://github.com/elastic/elasticsearch/issues/42041 SQL: Failing Group By queries due to different ExpressionIds Jun 10, 2019
@alpar-t alpar-t added the :Analytics/SQL SQL querying label Jun 11, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @emasab!

First of all apologies for the delay on picking up this PR and many thanks for your effort here.
The issue with the ExpressionIds is quite complicated and there are plans to solve it overall through a substantial refactoring. Your changes though, solve a substantial subset of the issues
so we decided to go ahead and proceed with this fix. I've changed the title and description of the PR and link to the issues that it fixes.

For some cases (without aliases or numeric refs) queries still fail, e.g.:
SELECT PI() * emp_no FROM test_emp GROUP BY PI() * emp_no ORDER BY PI() * emp_no LIMIT 10

Moreover I'd like to ask from you to add some integration tests for the issues that this fix addresses:
In agg.sql-spec file: https://github.com/emasab/elasticsearch/blob/bugfix/42041-failing-group-by-queries/x-pack/plugin/sql/qa/src/main/resources/agg.sql-spec#L52 the following:

groupByCastScalar
SELECT CAST(ABS(EXTRACT(YEAR FROM "birth_date")) AS BIGINT) FROM test_emp GROUP BY CAST(ABS(EXTRACT(YEAR FROM "birth_date")) AS BIGINT) ORDER BY CAST(ABS(EXTRACT(YEAR FROM "birth_date")) AS BIGINT) NULLS FIRST;
groupByCastScalarWithAlias
SELECT CAST(ABS(EXTRACT(YEAR FROM "birth_date")) AS BIGINT) as "cast" FROM test_emp GROUP BY "cast" ORDER BY "cast" NULLS FIRST;

and in agg.csv-spec file: https://github.com/emasab/elasticsearch/blob/bugfix/42041-failing-group-by-queries/x-pack/plugin/sql/qa/src/main/resources/agg.csv-spec#L220
the following:

// Fails for H2
groupByCastScalarWithNumericRef
SELECT CAST(ABS(EXTRACT(YEAR FROM "birth_date")) AS BIGINT) FROM test_emp GROUP BY 1 ORDER BY 1 NULLS FIRST;

CAST(ABS(EXTRACT(YEAR FROM "birth_date")) AS BIGINT):l
------------------------------------------------------
null
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
;

groupByConvertScalar
SELECT CONVERT(ABS(EXTRACT(YEAR FROM "birth_date")), SQL_BIGINT) FROM test_emp GROUP BY CONVERT(ABS(EXTRACT(YEAR FROM "birth_date")), SQL_BIGINT) ORDER BY CONVERT(ABS(EXTRACT(YEAR FROM "birth_date")), SQL_BIGINT) NULLS FIRST;


CONVERT(ABS(EXTRACT(YEAR FROM "birth_date")), SQL_BIGINT):l
-----------------------------------------------------------
null
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
;


groupByConvertScalarWithAlias
SELECT CONVERT(ABS(EXTRACT(YEAR FROM "birth_date")), SQL_BIGINT) as "convert" FROM test_emp GROUP BY "convert" ORDER BY "convert" NULLS FIRST;

convert:l
---------
null
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
;

groupByConvertScalarWithNumericRef
SELECT CONVERT(ABS(EXTRACT(YEAR FROM "birth_date")), SQL_BIGINT) FROM test_emp GROUP BY 1 ORDER BY 1 NULLS FIRST;

CONVERT(ABS(EXTRACT(YEAR FROM "birth_date")), SQL_BIGINT):l
-----------------------------------------------------------
null
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
;

// AwaitsFix https://github.com/elastic/elasticsearch/issues/36074
//groupByConstantScalar
//SELECT PI() * emp_no FROM test_emp GROUP BY PI() * emp_no ORDER BY PI() * emp_no LIMIT 10;

//PI() * emp_no:d
//---------------
//31419.0681285515
//31422.2097212051
//31425.3513138587
//31428.4929065123
//31431.6344991659
//31434.7760918195
//31437.9176844731
//31441.0592771266
//31444.2008697802
//31447.3424624338
//;

groupByConstantScalarWithAlias
SELECT PI() * emp_no AS "value" FROM test_emp GROUP BY value ORDER BY value LIMIT 10;

value:d
-------
31419.0681285515
31422.2097212051
31425.3513138587
31428.4929065123
31431.6344991659
31434.7760918195
31437.9176844731
31441.0592771266
31444.2008697802
31447.3424624338
;

groupByConstantScalarWithNumericRef
SELECT PI() * emp_no FROM test_emp GROUP BY 1 ORDER BY 1 LIMIT 10;

PI() * emp_no:d
---------------
31419.0681285515
31422.2097212051
31425.3513138587
31428.4929065123
31431.6344991659
31434.7760918195
31437.9176844731
31441.0592771266
31444.2008697802
31447.3424624338
;

@@ -779,6 +779,85 @@ public void testAllCountVariantsWithHavingGenerateCorrectAggregations() {
+ "\"gap_policy\":\"skip\"}}}}}"));
}

public void testGroupSelection() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to split this up to more methods with more descriptive names and add more tests so that all 3 cases (no alias, alias, numeric reference) are covered, similarly to the integration tests in my comment: #43072 (review)
Of course some of the cases, as mentioned, should be skipped as they won't work even with this fix.

@emasab
Copy link
Contributor Author

emasab commented Jul 30, 2019

Hi @matriv , happy to hear from you. I understand, the checks to do are a lot before giving an ok. I'm going to include these integration tests, change the unit tests and see if there is a simple fix for the failing query that you've highlighted.

@emasab
Copy link
Contributor Author

emasab commented Aug 3, 2019

Fixed the groupByConstantScalar case too, mainly unifying the semantic of NamedExpression.equals to that of NamedExpression.hashCode . The ExpressionId is still there but is used only as a unique identifier for the object instance. Changed the semantic of Failure.equals because now two NamedExpression are equal if they have the same tree even if the position in the source is different so Failure takes into account the source position not the parsed node.

@jakelandis jakelandis added v7.3.2 and removed v7.3.1 labels Aug 22, 2019
@pcsanwald pcsanwald added v6.8.4 and removed v6.8.3 labels Aug 29, 2019
@colings86 colings86 added v7.5.0 and removed v7.4.0 labels Aug 30, 2019
@polyfractal polyfractal added v7.3.3 and removed v7.3.2 labels Sep 6, 2019
@emasab
Copy link
Contributor Author

emasab commented Sep 7, 2019

Sorry, the latest commit

4902015

even if more correct theorically because of the difference between hashCode and equals, solves the specified case and passes the unit tests, causes a lot of failures with the integration test org.elasticsearch.xpack.sql.qa.single_node.JdbcSqlSpecIT. The previous one

8720563

is not affected

@emasab
Copy link
Contributor Author

emasab commented Sep 10, 2019

With the 9b7878b I've solved the integration tests failures while keeping the equals and hashCode methods consistent. The order by constant scalar is still excluded. With my previous commit there was an incomplete change to ResolveMissingRefs that made it work but I've reverted it because this other query failed:

SELECT last_name FROM "test_emp" ORDER BY birth_date::TIME, emp_no LIMIT 5

@emasab
Copy link
Contributor Author

emasab commented Sep 19, 2019

In the latest commit these is a fix for groupByConstantScalar case with ORDER BY, while keeping the equals consistent with the hashCode, and the integration tests successful

The result of the ordering is still not correct in the DESC case or when multiplied by -1 but the query seems to be translated correctly before it's processed by the Querier

@emasab emasab force-pushed the bugfix/42041-failing-group-by-queries branch from e904df4 to 4111876 Compare September 20, 2019 18:20
@emasab
Copy link
Contributor Author

emasab commented Sep 20, 2019

rebased on current master
changed expected result according to the behavior enforced by testGeoShapeInWhereClause

@emasab emasab force-pushed the bugfix/42041-failing-group-by-queries branch from 4111876 to 4a57627 Compare September 25, 2019 22:48
@emasab emasab force-pushed the bugfix/42041-failing-group-by-queries branch from 4a57627 to 2ccb219 Compare October 9, 2019 21:42
@emasab
Copy link
Contributor Author

emasab commented Oct 9, 2019

In the latest commit there is a fix for the results ordering when there is a grouped ScalarFunctionAttribute in the ORDER BY clause, like:
SELECT PI() * emp_no FROM test_emp GROUP BY PI() * emp_no ORDER BY PI() * emp_no DESC LIMIT 10;

That should be the last change for this PR, unless other issues come out, you request some changes or rebases to master

@matriv
Copy link
Contributor

matriv commented Oct 29, 2019

@elasticmachine test this please

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @emasab,

Once again thanks a lot for your effort here.
There has been delay in getting back to you, as there were other features and bugs in our plate, and as stated before there is an ongoing effort of complete refactoring in the area of NamedExpression and ExpressionIds.

The state of the PR though looks good overall to me and would be nice to fix those issues independently of the refactoring that might take more time to be ready.

I left another round of comments and I'd like to do some more testing myself using your branch.

public static AttributeSet referencesExcept(Set<? extends Expression> exps,
AttributeSet except) {
AttributeSet ret = new AttributeSet();
while(exps.size()>0){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the logic here, why do you need the while loop and revisit the exps which have become filteredExps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a breadth-first search where some nodes are removed during the traversal. At each loop the exps represents the previous list of tree nodes and filteredExps the expressions that aren't in the except set. At the end of the loop exps is replaced with the children of each node. When a leaf is found, its references are added to ret.

@matriv matriv requested a review from costin October 29, 2019 16:34
emasab and others added 2 commits October 29, 2019 23:27
code style

Co-Authored-By: Marios Trivyzas <matriv@gmail.com>
Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for addressing the comments @emasab
I left a few more, minor, regarding code formatting.

emasab and others added 2 commits October 30, 2019 18:57
code style

Co-Authored-By: Marios Trivyzas <matriv@gmail.com>
@emasab
Copy link
Contributor Author

emasab commented Oct 30, 2019

Thx for addressing the comments @emasab
I left a few more, minor, regarding code formatting.

Now I've set up the save action and the eclipseformat :)

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @emasab for your work, your patience and the effort and time spent on addressing the comments.

@matriv
Copy link
Contributor

matriv commented Oct 30, 2019

@elasticmachine test this please

@matriv matriv merged commit 3c38ea5 into elastic:master Oct 31, 2019
matriv pushed a commit that referenced this pull request Oct 31, 2019
Fix an issue that arises from the use of ExpressionIds as keys in a lookup map
that helps the QueryTranslator to identify the grouping columns. The issue is
that the same expression in different parts of the query (SELECT clause and GROUP BY clause)
ends up with different ExpressionIds so the lookup fails. So, instead of ExpressionIds
use the hashCode() of NamedExpression.

Fixes: #41159
Fixes: #40001
Fixes: #40240
Fixes: #33361
Fixes: #46316
Fixes: #36074
Fixes: #34543
Fixes: #37044

Fixes: #42041
(cherry picked from commit 3c38ea5)
matriv pushed a commit that referenced this pull request Oct 31, 2019
Fix an issue that arises from the use of ExpressionIds as keys in a lookup map
that helps the QueryTranslator to identify the grouping columns. The issue is
that the same expression in different parts of the query (SELECT clause and GROUP BY clause)
ends up with different ExpressionIds so the lookup fails. So, instead of ExpressionIds
use the hashCode() of NamedExpression.

Fixes: #41159
Fixes: #40001
Fixes: #40240
Fixes: #33361
Fixes: #46316
Fixes: #36074
Fixes: #34543
Fixes: #37044

Fixes: #42041
(cherry picked from commit 3c38ea5)
matriv pushed a commit that referenced this pull request Oct 31, 2019
Fix an issue that arises from the use of ExpressionIds as keys in a lookup map
that helps the QueryTranslator to identify the grouping columns. The issue is
that the same expression in different parts of the query (SELECT clause and GROUP BY clause)
ends up with different ExpressionIds so the lookup fails. So, instead of ExpressionIds
use the hashCode() of NamedExpression.

Fixes: #41159
Fixes: #40001
Fixes: #40240
Fixes: #33361
Fixes: #46316
Fixes: #36074
Fixes: #34543
Fixes: #37044

Fixes: #42041
(cherry picked from commit 3c38ea5)
matriv pushed a commit that referenced this pull request Oct 31, 2019
Fix an issue that arises from the use of ExpressionIds as keys in a lookup map
that helps the QueryTranslator to identify the grouping columns. The issue is
that the same expression in different parts of the query (SELECT clause and GROUP BY clause)
ends up with different ExpressionIds so the lookup fails. So, instead of ExpressionIds
use the hashCode() of NamedExpression.

Fixes: #41159
Fixes: #40001
Fixes: #40240
Fixes: #33361
Fixes: #46316
Fixes: #36074
Fixes: #34543
Fixes: #37044

Fixes: #42041
(cherry picked from commit 3c38ea5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment