-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #29135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #125985 has finished for PR 29135 at commit
|
|
Test build #126016 has finished for PR 29135 at commit
|
| * if ('id > 1) 'cat2 else null, | ||
| * cast('value as bigint), | ||
| * if ('key = "a") cast('value as bigint) else null] | ||
| * output = ['key, '_gen_attr_1, '_gen_attr_2, '_gen_attr_3, '_gen_attr_4]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cat1 is not related to the filter, why do we change its name to _gen_attr_1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For convenience and unification, we always alias the column, even if there is no filter.
...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
Show resolved
Hide resolved
| // aggregate function. For example, | ||
| // 1.for AVG(DISTINCT value) GROUP BY key, the grouping expressions will be [key, value]. | ||
| // 2.for AVG (DISTINCT value) Filter (WHERE age > 20) GROUP BY key, the grouping expression | ||
| // will be [key, value, age]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AVG (DISTINCT value) Filter (WHERE age > 20) this will be rewritten as AVG (DISTINCT _gen_attr) Filter (WHERE _gen_attr is not null). So here we should group by key and _gen_attr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your remind.
|
Test build #126036 has finished for PR 29135 at commit
|
|
Test build #126050 has finished for PR 29135 at commit
|
|
Test build #126052 has finished for PR 29135 at commit
|
|
retest this please |
|
Test build #126059 has finished for PR 29135 at commit
|
|
retest this please |
|
Test build #126082 has finished for PR 29135 at commit
|
|
Test build #126084 has finished for PR 29135 at commit
|
|
Test build #126088 has finished for PR 29135 at commit
|
|
Test build #126089 has finished for PR 29135 at commit
|
|
retest this please |
|
Test build #126100 has finished for PR 29135 at commit
|
|
retest this please |
|
Test build #126109 has finished for PR 29135 at commit
|
What changes were proposed in this pull request?
This PR is related to #26656.
#26656 only support use FILTER clause on aggregate expression without DISTINCT.
This PR will enhance this feature when one or more DISTINCT aggregate expressions which allows the use of the FILTER clause.
Such as:
Note:
In #26656, we use
AggregationIteratorto treat the filter conditions of aggregate expr. This is good because we can evaluate filter in first aggregate locally.But
AggregationIteratoronly support single DISTINCT aggregate with filter clause.So, this PR uses
Projectto project the filter clause as new generated attribute(e.g. _gen_attr_0) and ensure the evaluation at local.Why are the changes needed?
Spark SQL only support use FILTER clause on aggregate expression without DISTINCT.
This PR support Filter expression allows simultaneous use of DISTINCT
Does this PR introduce any user-facing change?
No
How was this patch tested?
Exists and new UT