-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #29291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #126778 has finished for PR 29291 at commit
|
|
Test build #126797 has finished for PR 29291 at commit
|
|
cc @cloud-fan |
| // group without filter clause. | ||
| // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a). | ||
| distinctAggs.size > 1 | ||
| distinctAggs.size > 1 || (distinctAggs.size == 1 && aggExpressions.exists(_.filter.isDefined)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove distinctAggs.size == 1, as it's indicarted by distinctAggs.size > 1 || ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If distinctAggs.size == 0 and aggExpressions.exists(_.filter.isDefined), we not need this rewrite.
The normal agg with filter could treated by physical plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it be distinctAggs.exists(_.filter.isDefined)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I got it.
| // Setup all the filters in distinct aggregate. | ||
| val distinctAggExprs = aggExpressions | ||
| .filter(e => e.isDistinct && e.children.exists(!_.foldable)) | ||
| val distinctAggFilterAttrMap = distinctAggExprs.collect { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: val (distinctAggFilters, distinctAggFilterAttrs, maxCond) = distinctAggExprs.collect(...).unzip3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I want
val distinctAggFilterAttrLookup = distinctAggFilterAttrMap.map { tuple3 =>
tuple3._1 -> tuple3._3.toAttribute
}.toMap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is distinctAggFilters.zip(maxCond.map(_.toAttribute)).toMap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
Show resolved
Hide resolved
|
Test build #126810 has finished for PR 29291 at commit
|
|
retest this please |
|
can you rebase/merge with the master branch to get the github action fix? The jenkin is quite unstable now and we may need to rely on github actions |
OK |
...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
Outdated
Show resolved
Hide resolved
|
Test build #126843 has finished for PR 29291 at commit
|
...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
Show resolved
Hide resolved
| val (distinctAggFilters, distinctAggFilterAttrs, maxConds) = distinctAggExprs.collect { | ||
| case AggregateExpression(_, _, _, filter, _) if filter.isDefined => | ||
| val (e, attr) = expressionAttributePair(filter.get) | ||
| val aggregateExp = AggregateExpression(Max(attr), Partial, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Max(attr).toAggregateExpression(distinct = false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good!
| SELECT COUNT(DISTINCT id), COUNT(DISTINCT id) FILTER (WHERE date_format(hiredate, "yyyy-MM-dd HH:mm:ss") = "2001-01-01 00:00:00") FROM emp; | ||
| SELECT COUNT(DISTINCT id) FILTER (WHERE hiredate = to_timestamp("2001-01-01 00:00:00")), COUNT(DISTINCT id) FILTER (WHERE hiredate = to_date('2001-01-01 00:00:00')) FROM emp; | ||
| SELECT SUM(salary), COUNT(DISTINCT id), COUNT(DISTINCT id) FILTER (WHERE hiredate = date "2001-01-01") FROM emp; | ||
| SELECT COUNT(DISTINCT 1) FILTER (WHERE a = 1) FROM testData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also test COUNT(DISTINCT id) FILTER (WHERE true) and COUNT(DISTINCT id) FILTER (WHERE false)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
Test build #126857 has finished for PR 29291 at commit
|
|
Test build #126866 has finished for PR 29291 at commit
|
|
Test build #126886 has finished for PR 29291 at commit
|
|
Test build #126953 has finished for PR 29291 at commit
|
|
Test build #126956 has finished for PR 29291 at commit
|
|
retest this please |
|
Test build #126968 has finished for PR 29291 at commit
|
|
retest this please |
|
Test build #126980 has finished for PR 29291 at commit
|
|
retest this please |
|
Test build #126994 has finished for PR 29291 at commit
|
|
thanks, merging to master! |
|
@cloud-fan Thanks for your review and good idea. |
What changes were proposed in this pull request?
This PR is related to #26656.
#26656 only support use FILTER clause on aggregate expression without DISTINCT.
This PR will enhance this feature when one or more DISTINCT aggregate expressions which allows the use of the FILTER clause.
Such as:
Why are the changes needed?
Spark SQL only support use FILTER clause on aggregate expression without DISTINCT.
This PR support Filter expression allows simultaneous use of DISTINCT
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
Exists and new UT