-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Min/Max using Delta metadata #1525
Optimize Min/Max using Delta metadata #1525
Conversation
@felipepessoto just following up on this PR - is it still a WIP? |
Yes, I made these changes while the SELECT Count was in review, I think I can refine this. |
a3ace44
to
7f665e5
Compare
@scottsand-db it is ready to review. Thanks |
core/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
7583cf2
to
df2f58c
Compare
core/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
d118b5b
to
a065457
Compare
Hi folks, did you have a chance to review this? |
@vkorukanti, @scottsand-db, do you think we'll be able to complete this before 2.4 release? |
@scottsand-db, @vkorukanti if you have a chance to review this please. Would be great to have this in 2.5. And once it is completed I'd like to work on other improvements: support to DV, partitioning, group by, etc |
@scottsand-db, @vkorukanti, do we still plan to go ahead with these improvements? Let me know to rebase the changes. |
@felipepessoto - thanks for following up. We are super swamped right now getting a few final features ready for next Delta release ... we will follow up when we can! |
I'm wondering if this is still on the agenda? I think it would be a wonderful enhancement. There are many practical use cases where performance improvements on such min/max queries would make a difference. Two examples:
|
We have some folks asking for more improvements using stats here and in other issues/PRs. I think it would help in a couple of scenarios like @henlue mentioned. @scottsand-db, @vkorukanti, @dennyglee what would be the best way get community feedback about this? Creating a new issue and asking people to thumbs up would be useful? Is it something maintainers use to prioritize the new features? Thanks |
2c76d0a
to
f798b76
Compare
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Add column mapping tests using the existing traits. Add test using partitioned column filter Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
…on values if all values were found in stats Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
f290bc6
to
9a8feb9
Compare
…ax from partitioned columns even when COUNT is not available Fix style error Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
9a8feb9
to
2c70c6b
Compare
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Show resolved
Hide resolved
c@Count(Seq(Literal(1, _))), Complete, false, None, _) => | ||
Some(c) | ||
case AggregateExpression( | ||
min@Min(minExpr), Complete, false, None, _) if isSupportedDataType(minExpr.dataType) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make the minExpr
(also maxExpr) into a match
object SkippingEligibleColumn {
// returns attribute name and data type
def unapply(arg: Expression): Option[(Seq[String], DataType)] = {
// Here also check whether the arg is an AtributeReference or not.
// not a nested column
// and even the data type check as well.
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same object.unapply can be used in PhysicalOperation
matching.
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
.map(x => x._1).toSet | ||
|
||
// Creates a tuple with physical name to avoid recalculating it multiple times | ||
val dataColumnsWithStats = dataColumns.map(x => (x, DeltaColumnMapping.getPhysicalName(x))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one suggestion to simplify the code:
- Add a utility method to get the
Column
ref for min/max/nullCount/numRecords for regular or partition columns from the DataframedeltaScanGenerator.filesWithStatsForScan
. It abstracts out the physical name conversion and the lookup for partition or data column. If the column is a partition column, then it also takes care of type-casting the string partition value to the appropriate data type value. - The next step is to construct the expression using these refs that validate the stats and then return the min and max. the existing expression you have should work.
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Show resolved
Hide resolved
spark/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. minor comments.
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
-table with DVs -empty table -table with few AddFiles having zero rows Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
262ca37
to
ce394fb
Compare
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
spark/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala
Show resolved
Hide resolved
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuery.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (pending @vkorukanti 's final pass)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm pending one comment.
spark/src/test/scala/org/apache/spark/sql/delta/perf/OptimizeMetadataOnlyDeltaQuerySuite.scala
Outdated
Show resolved
Hide resolved
…timization disabled Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thank you for contributing this optimizaiton.
Flink tests are flaky? Previous build succeeded |
Description
Follow up of #1192, which optimizes COUNT. This PR adds support for MIN/MAX as well.
Fix #2092
How was this patch tested?
Created additional unit tests to cover MIN/MAX.
Does this PR introduce any user-facing changes?
Only performance improvement