Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Min/Max using Delta metadata #1525

Closed

Conversation

felipepessoto
Copy link
Contributor

@felipepessoto felipepessoto commented Dec 17, 2022

Description

Follow up of #1192, which optimizes COUNT. This PR adds support for MIN/MAX as well.

Fix #2092

How was this patch tested?

Created additional unit tests to cover MIN/MAX.

Does this PR introduce any user-facing changes?

Only performance improvement

@scottsand-db
Copy link
Collaborator

@felipepessoto just following up on this PR - is it still a WIP?

@felipepessoto
Copy link
Contributor Author

Yes, I made these changes while the SELECT Count was in review, I think I can refine this.

@felipepessoto felipepessoto changed the title [WIP] Optimize Min/Max using Delta stats Optimize Min/Max using Delta stats Jan 28, 2023
@felipepessoto
Copy link
Contributor Author

@scottsand-db it is ready to review. Thanks

@felipepessoto felipepessoto changed the title Optimize Min/Max using Delta stats Optimize Min/Max using Delta metadata Jan 30, 2023
@felipepessoto felipepessoto force-pushed the improvedatafromstats branch 2 times, most recently from d118b5b to a065457 Compare March 31, 2023 03:27
@felipepessoto
Copy link
Contributor Author

Hi folks, did you have a chance to review this?
Thanks

@felipepessoto
Copy link
Contributor Author

@scottsand-db

@felipepessoto
Copy link
Contributor Author

@vkorukanti, @scottsand-db, do you think we'll be able to complete this before 2.4 release?

@felipepessoto
Copy link
Contributor Author

felipepessoto commented May 25, 2023

@scottsand-db, @vkorukanti if you have a chance to review this please. Would be great to have this in 2.5.

And once it is completed I'd like to work on other improvements: support to DV, partitioning, group by, etc

@felipepessoto
Copy link
Contributor Author

@scottsand-db, @vkorukanti, do we still plan to go ahead with these improvements? Let me know to rebase the changes.

@scottsand-db
Copy link
Collaborator

@felipepessoto - thanks for following up. We are super swamped right now getting a few final features ready for next Delta release ... we will follow up when we can!

@felipepessoto
Copy link
Contributor Author

#1763

@henlue
Copy link

henlue commented Aug 31, 2023

I'm wondering if this is still on the agenda? I think it would be a wonderful enhancement.

There are many practical use cases where performance improvements on such min/max queries would make a difference. Two examples:

  • when incrementally loading data to a table, often the first step is to query the max timestamp of that table in order to figure out from where to continue loading more data
  • BI tools will query the min max values of columns to configure the ranges for their filters or slicers

@felipepessoto
Copy link
Contributor Author

We have some folks asking for more improvements using stats here and in other issues/PRs. I think it would help in a couple of scenarios like @henlue mentioned.

#1192
#1916
#1377

@scottsand-db, @vkorukanti, @dennyglee what would be the best way get community feedback about this? Creating a new issue and asking people to thumbs up would be useful? Is it something maintainers use to prioritize the new features?

Thanks

@felipepessoto felipepessoto force-pushed the improvedatafromstats branch 2 times, most recently from 2c76d0a to f798b76 Compare November 22, 2023 12:47
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Add column mapping tests using the existing traits.
Add test using partitioned column filter

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
…on values if all values were found in stats

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
@felipepessoto felipepessoto force-pushed the improvedatafromstats branch 3 times, most recently from f290bc6 to 9a8feb9 Compare November 22, 2023 13:04
…ax from partitioned columns even when COUNT is not available

Fix style error

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
c@Count(Seq(Literal(1, _))), Complete, false, None, _) =>
Some(c)
case AggregateExpression(
min@Min(minExpr), Complete, false, None, _) if isSupportedDataType(minExpr.dataType) =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make the minExpr (also maxExpr) into a match

object SkippingEligibleColumn {
  // returns attribute name and data type
  def unapply(arg: Expression): Option[(Seq[String], DataType)] = {
      // Here also check whether the arg is an AtributeReference or not. 
      // not a nested column
      // and even the data type check as well.
  }
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same object.unapply can be used in PhysicalOperation matching.

.map(x => x._1).toSet

// Creates a tuple with physical name to avoid recalculating it multiple times
val dataColumnsWithStats = dataColumns.map(x => (x, DeltaColumnMapping.getPhysicalName(x)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one suggestion to simplify the code:

  1. Add a utility method to get the Column ref for min/max/nullCount/numRecords for regular or partition columns from the Dataframe deltaScanGenerator.filesWithStatsForScan. It abstracts out the physical name conversion and the lookup for partition or data column. If the column is a partition column, then it also takes care of type-casting the string partition value to the appropriate data type value.
  2. The next step is to construct the expression using these refs that validate the stats and then return the min and max. the existing expression you have should work.

Copy link
Contributor

@lzlfred lzlfred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. minor comments.

-table with DVs
-empty table
-table with few AddFiles having zero rows

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Copy link
Contributor

@weiluo-db weiluo-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (pending @vkorukanti 's final pass)!

Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm pending one comment.

…timization disabled

Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>
Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Thank you for contributing this optimizaiton.

@felipepessoto
Copy link
Contributor Author

Flink tests are flaky? Previous build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

[Feature Request][Spark] Optimize Min/Max using Delta metadata
7 participants