[SPARK-35231][SQL] logical.Range override maxRowsPerPartition #32350

zhengruifeng · 2021-04-26T11:00:23Z

What changes were proposed in this pull request?

when numSlices is avaiable, logical.Range should compute a exact maxRowsPerPartition

Why are the changes needed?

maxRowsPerPartition is used in optimizer, we should provide an exact value if possible

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

init

SparkQA · 2021-04-26T12:25:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42476/

SparkQA · 2021-04-26T12:25:55Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42476/

SparkQA · 2021-04-26T16:11:45Z

Test build #137955 has finished for PR 32350 at commit f82b0ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-04-27T02:13:05Z

cc @wangyum FYI

wangyum · 2021-04-27T12:19:12Z

existing testsuites

Do we have tests to cover this change?

zhengruifeng · 2021-04-28T01:53:40Z

@wangyum No, there is no test to cover maxRowsPerPartition. There is some test in CombiningLimitsSuite to check maxRows, should I add some test there?

maropu · 2021-04-28T08:44:46Z

Yea, please add tests in CombiningLimitsSuite. The fix itself looks fine.

zhengruifeng · 2021-04-29T02:20:35Z

To add a similar test in CombiningLimitsSuite, some additional changes are involved. I'm not sure whether to switch to a simple test like:

scala> spark.range(0, 100, 1, 3).rdd.mapPartitions(iter => Iterator(iter.size)).max == 34
res0: Boolean = true

zhengruifeng · 2021-04-29T02:29:50Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

    extends OrderPreservingUnaryNode {
  override def output: Seq[Attribute] = projectList.map(_.toAttribute)
  override def maxRows: Option[Long] = child.maxRows
+  override def maxRowsPerPartition: Option[Long] = child.maxRowsPerPartition


this override is needed for the added test

SparkQA · 2021-04-29T03:22:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42577/

SparkQA · 2021-04-29T03:22:49Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42577/

SparkQA · 2021-04-29T04:20:11Z

Test build #138058 has finished for PR 32350 at commit 0a69629.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-29T07:28:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42591/

SparkQA · 2021-04-29T07:35:18Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42591/

SparkQA · 2021-04-29T11:04:54Z

Test build #138071 has finished for PR 32350 at commit 30db1c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu

Please update the PR and the description. I feel what they say is different from what this PR looks like.

maropu · 2021-05-01T22:21:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

      child
    case GlobalLimit(l, child) if canEliminate(l, child) =>
      child
+    case LocalLimit(l, child) if !plan.isStreaming && canEliminateLocalLimit(l, child) =>


It is not possible that a user's query reaches this optimization path now?

In a streaming case, maxRowsPerPartition can be filled? (we need the condition !plan.isStreaming here?)

It is not possible that a user's query reaches this optimization path now?

end user's query should not reaches this path, I think. This path is only for adding a similar test in CombiningLimitsSuite

In a streaming case, maxRowsPerPartition can be filled? (we need the condition !plan.isStreaming here?)

org.apache.spark.sql.streaming.StreamSuite.SPARK-30657: streaming limit optimization from StreamingLocalLimitExec to LocalLimitExec fails if do not add this condition.

maropu · 2021-05-01T22:23:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala


      def limit(limitExpr: Expression): LogicalPlan = Limit(limitExpr, logicalPlan)

+      def localLimit(limitExpr: Expression): LogicalPlan = LocalLimit(limitExpr, logicalPlan)


Since this is used only once now, could you use LocalLimit directly in the test?

wangyum · 2021-05-02T02:05:41Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CombiningLimitsSuite.scala

+    checkPlanAndMaxRowsPerPartition(
+      Range(0, 100, 1, 3).select().localLimit(34),
+      Range(0, 100, 1, 3).select(),
+      34
+    )


Could we make the test more simple? For example:

assert(Range(0, 100, 1, 3).maxRowsPerPartition === Some(34)) assert(Range(0, 100, 1, 4).maxRowsPerPartition === Some(25)) assert(Range(0, 100, 1, 3).select('id).maxRowsPerPartition === Some(34))

zhengruifeng · 2021-05-06T02:55:15Z

@wangyum @maropu Thanks for reviewing!
I think I made this PR too complex, and will follow @wangyum 's commment to use a simpler testsuite.

SparkQA · 2021-05-06T07:42:02Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42714/

SparkQA · 2021-05-06T11:33:31Z

Test build #138193 has finished for PR 32350 at commit a3e26c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu

cc: @wangyum

maropu · 2021-05-09T12:45:28Z

Thank you, @zhengruifeng . Merged to master.

zhengruifeng · 2021-05-10T02:07:46Z

Thank you so much!

init

f82b0ab

init

github-actions bot added the SQL label Apr 26, 2021

add test

0a69629

zhengruifeng commented Apr 29, 2021

View reviewed changes

fix test

30db1c2

maropu reviewed May 1, 2021

View reviewed changes

wangyum reviewed May 2, 2021

View reviewed changes

use a simple test

a3e26c7

maropu approved these changes May 8, 2021

View reviewed changes

wangyum approved these changes May 9, 2021

View reviewed changes

maropu closed this in 620f072 May 9, 2021

zhengruifeng deleted the range_maxRowsPerPartition branch May 10, 2021 02:07


		def limit(limitExpr: Expression): LogicalPlan = Limit(limitExpr, logicalPlan)

		def localLimit(limitExpr: Expression): LogicalPlan = LocalLimit(limitExpr, logicalPlan)

[SPARK-35231][SQL] logical.Range override maxRowsPerPartition #32350

[SPARK-35231][SQL] logical.Range override maxRowsPerPartition #32350

Uh oh!

Conversation

zhengruifeng commented Apr 26, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Apr 26, 2021

Uh oh!

SparkQA commented Apr 26, 2021

Uh oh!

SparkQA commented Apr 26, 2021

Uh oh!

HyukjinKwon commented Apr 27, 2021

Uh oh!

wangyum commented Apr 27, 2021

Uh oh!

zhengruifeng commented Apr 28, 2021

Uh oh!

maropu commented Apr 28, 2021

Uh oh!

zhengruifeng commented Apr 29, 2021

Uh oh!

zhengruifeng Apr 29, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 29, 2021

Uh oh!

SparkQA commented Apr 29, 2021

Uh oh!

SparkQA commented Apr 29, 2021

Uh oh!

SparkQA commented Apr 29, 2021

Uh oh!

SparkQA commented Apr 29, 2021

Uh oh!

SparkQA commented Apr 29, 2021

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

maropu May 1, 2021

Choose a reason for hiding this comment

Uh oh!

maropu May 1, 2021

Choose a reason for hiding this comment

Uh oh!

zhengruifeng May 6, 2021

Choose a reason for hiding this comment

Uh oh!

maropu May 1, 2021

Choose a reason for hiding this comment

Uh oh!

wangyum May 2, 2021

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented May 6, 2021

Uh oh!

SparkQA commented May 6, 2021

Uh oh!

SparkQA commented May 6, 2021

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

maropu commented May 9, 2021

Uh oh!

zhengruifeng commented May 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants