Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18089][SQL] Remove shuffle codes in CollectLimitExec #15596

Closed
wants to merge 8 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Oct 22, 2016

What changes were proposed in this pull request?

Currently, CollectLimitExec is an operator used when the logical Limit is the last operator in a logical plan. In fact, the job of CollectLimitExec is not different to GlobalLimitExec. We can do little refactoring to GlobalLimitExec and replace CollectLimitExec.

How was this patch tested?

Jenkins tests.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

@viirya
Copy link
Member Author

viirya commented Oct 22, 2016

cc @rxin @yhuai May you take a quick look to see if this direction is ok for you? Thanks!

@@ -39,7 +39,7 @@
protected int partitionIndex = -1;

public boolean hasNext() throws IOException {
if (currentRows.isEmpty()) {
if (!shouldStop()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldStop in whole stage codegen can be override to have custom stop condition other than just currentRows.isEmpty().

For example, in limit, we will stop iterator processing early if limitation is met.

Without this change, this pr will fail one test in SQLQuerySuite: "SPARK-17515: CollectLimit.execute() should perform per-partition limits". Because the LocalLimit will not stop immediately but in next round after reaching limitation.

@@ -83,9 +83,8 @@ trait BaseLimitExec extends UnaryExecNode with CodegenSupport {
s"""
| if ($countTerm < $limit) {
| $countTerm += 1;
| if ($countTerm == $limit) $stopEarly = true;
Copy link
Member Author

@viirya viirya Oct 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set stop early flag to true once reaching the limit, so we won't step into next element in iterator. Otherwise, we still go to get next element from the iterator. So if the limitation is 1, we will get 2 elements from the iterator.

Note: This won't cause real problem because we have an if to guard against processing element. But it fails the test in SQLQuerySuite due to it uses accumulator to count the elements.

@viirya
Copy link
Member Author

viirya commented Oct 22, 2016

BTW, we can see if there is an exchange added for CollectLimit from physical plan, e.g.,

CollectLimit 1
+- Exchange SinglePartition
  +- *LocalLimit 1
     +- *HashAggregate(keys=[str#227], functions=[count(1)], output=[str#227, count(1)#235L])
        +- Exchange hashpartitioning(str#227, 5)
           +- *HashAggregate(keys=[str#227], functions=[partial_count(1)], output=[str#227, count#241L])
              +- *Project [str#227]
                 +- *BroadcastHashJoin [str#227], [str#233], Inner, BuildRight
                    :- *Project [_2#224 AS str#227]
                    :  +- *Filter isnotnull(_2#224)
                    :     +- LocalTableScan [_1#223, _2#224]
                    +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
                       +- *Project [_2#224 AS str#233]
                          +- *Filter isnotnull(_2#224)
                             +- LocalTableScan [_1#223, _2#224]

@SparkQA
Copy link

SparkQA commented Oct 22, 2016

Test build #67371 has finished for PR 15596 at commit 3d24f79.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

protected override def doExecute(): RDD[InternalRow] = {
val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are removing an optimization here right? We can greatly reduce the number of shuffled records by applying the limit before anything gets shuffled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use a LocalLimit to do this optimization. I think we should use existing physical plans as much as possible, instead of rdd manipulation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One advantage is LocalLimit supports whole stage codegen. We can also easily get the idea of this optimization from physical plans.

@hvanhovell
Copy link
Contributor

@JoshRosen could you shed some light on why we are not using the regular EnsureRequirements based code path for CollectLimitExec?

@SparkQA
Copy link

SparkQA commented Oct 24, 2016

Test build #67427 has finished for PR 15596 at commit 6d7095c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya force-pushed the refactor-collectlimit branch from 6d7095c to e8520cf Compare October 24, 2016 14:47
@viirya viirya force-pushed the refactor-collectlimit branch from e8520cf to 76a3eaf Compare October 24, 2016 14:48
@@ -661,6 +661,15 @@ case class HashAggregateExec(
""".stripMargin
}

ctx.addNewFunction("releaseResource", s"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to release memory by calling these functions. Because we stop iterating the data early by letting shouldStop return true, we won't call next processNext and memory leak will happen. So we wrap these calls in override releaseResource function.

@@ -47,6 +50,22 @@ case class LocalTableScanExec(

private lazy val rdd = sqlContext.sparkContext.parallelize(unsafeRows, numParallelism)

protected override def doProduce(ctx: CodegenContext): String = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let LocalTableScanExec support whole stage codegen.

Because CollectLimitExec now supports whole stage codegen, the test in SQLMetricsSuite:

val df2 = spark.createDataset(Seq(1, 2, 3)).limit(2)
df2.collect()
val metrics2 = df2.queryExecution.executedPlan.collectLeaves().head.metrics
assert(metrics2.contains("numOutputRows"))
assert(metrics2("numOutputRows").value === 2)

will execute the LocalTableScanExec node to get its RDD. Then an InputAdapter will connect it to CollectLimitExec's whole stage codegen node. So it will output all 3 rows in the local table.

Adding this whole stage code support seems straightforward. So I adds it here to pass the tests.

@viirya
Copy link
Member Author

viirya commented Oct 24, 2016

hmmm, I don't image that it is needed for such related changes to remove shuffle codes in CollectLimitExec.

The main change is to modify whole stage codegen of Limit. There is a place in it needed to change in order to exactly iterate the data in the limited number. I left the comments in above.

This is not a serious issue and if you think this change is too big for the purpose, please let me know. Thanks!

@SparkQA
Copy link

SparkQA commented Oct 24, 2016

Test build #67452 has finished for PR 15596 at commit 76a3eaf.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

For a bit of background on CollectLimitExec, note that the intent of this operator is to be able to optimize the special case where a limit is the terminal operator of a plan: in this case, we don't need to perform a shuffle because we can have the driver run multiple jobs which scan an increasingly large portion of the RDD to get the limited items; in a nutshell, the goal is to allow a logic similar to the RDD take() action to stop early without having to compute all partitions of the limited RDD.

In typical operation, we don't necessarily expect CollectLimitExec to appear in the middle of a query plan, so CollectLimitExec.execute() should generally only be called in special cases such as calling .rdd() on a limited RDD then performing further operations on it. This is why I didn't use EnsureRequirements here: if we did, then we'd end up shuffling all limited partitions to a single non-driver partition, then limiting that and collecting to the driver, degrading performance in the case where a limit is the terminal operation in the query plan.

private val serializer: Serializer = new UnsafeRowSerializer(child.output.size)
override def requiredChildDistribution: List[Distribution] = AllTuples :: Nil
override def executeCollect(): Array[InternalRow] = {
child.collect {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I understand the intent here: it looks like the idea is to let EnsureRequirements handle the shuffling, if necessary, then to strip off the shuffling and walk backwards in the DAG to optimize the collect() case. This is pretty clever.

One concern of mine, though, is that there seems to be an implicit assumption on the types of children that this operator can have and I think it would be a good idea to write them down and pattern-match more explicitly. For example, I'm not sure that this strategy is safe in case the child isn't an exchange because then you run the risk of simply dropping any operators that occur post-exchange. I understand that this case can't crop up in the types of plans that we currently generate but it would be good to future-proof by matching explicitly rather than having the less-constrained collect behavior here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, one other idea: if you're going to try this strategy, what do you think about putting this logic into GlobalLimitExec and removing the CollectLimitExec planning logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems a good idea. I once thought about the difference between GlobalLimitExec and CollectLimitExec during refactoring this. But decided not to change too much. Looks like I could remove CollectLimitExec.

@viirya
Copy link
Member Author

viirya commented Oct 25, 2016

retest this please.

@viirya
Copy link
Member Author

viirya commented Oct 25, 2016

In typical operation, we don't necessarily expect CollectLimitExec to appear in the middle of a query plan, so CollectLimitExec.execute() should generally only be called in special cases such as calling .rdd() on a limited RDD then performing further operations on it. This is why I didn't use EnsureRequirements here: if we did, then we'd end up shuffling all limited partitions to a single non-driver partition, then limiting that and collecting to the driver, degrading performance in the case where a limit is the terminal operation in the query plan.

Yeah, I think I understand the intent. So in the executeCollect, I do strip off the shuffling if needed.

@SparkQA
Copy link

SparkQA commented Oct 25, 2016

Test build #67480 has finished for PR 15596 at commit 76a3eaf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya changed the title [SQL] Remove shuffle codes in CollectLimitExec [SPARK-18089][SQL] Remove CollectLimitExec Oct 25, 2016
override def executeCollect(): Array[InternalRow] = child match {
// This happens when the user is collecting results back to the driver, we could skip
// the shuffling and scan increasingly the RDD to get the limited items.
case g: GlobalLimitExec => g.executeCollect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really confusing ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I am polishing the comment, hope it helpful. Would like to hear any suggestions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think this is confusing as you said. Removed it.

@SparkQA
Copy link

SparkQA commented Oct 25, 2016

Test build #67492 has finished for PR 15596 at commit fbf4fd6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 25, 2016

Test build #67498 has finished for PR 15596 at commit 82ebff4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 25, 2016

Test build #67501 has finished for PR 15596 at commit 44c64e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya force-pushed the refactor-collectlimit branch 3 times, most recently from 53a956e to e919f4a Compare October 26, 2016 04:51
@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67557 has finished for PR 15596 at commit e919f4a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67556 has finished for PR 15596 at commit 110a3e4.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya force-pushed the refactor-collectlimit branch from e919f4a to 360752c Compare October 26, 2016 08:17
@viirya viirya force-pushed the refactor-collectlimit branch from 360752c to 86b4e42 Compare October 26, 2016 08:28
@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67569 has finished for PR 15596 at commit 360752c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67571 has finished for PR 15596 at commit 86b4e42.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Helper trait which defines methods that are shared by both
* [[LocalLimitExec]] and [[GlobalLimitExec]].
*/
trait BaseLimitExec extends UnaryExecNode with CodegenSupport {
val limit: Int
override def output: Seq[Attribute] = child.output
override def executeCollect(): Array[InternalRow] = child.executeTake(limit)
override def executeTake(n: Int): Array[InternalRow] = child.executeTake(limit)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will want to add in an executeToIterator override as well to make sure we don't cause a shuffle there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pwoody! Agreed. But I am thinking not to replace CollectLimitExec with GlobalLimitExec. The reason is commented below. Let's wait for @JoshRosen's response. If we decide to keep CollectLimitExec, your change at #15614 can be applied then.

@viirya
Copy link
Member Author

viirya commented Oct 27, 2016

@JoshRosen After few tries, I think to replace CollectLimitExec with GlobalLimitExec is not a good idea.

The main reason is whole stage codegen. Since GlobalLimitExec supports whole stage codegen, it will be wrapped in a WholeStageCodegenExec. So we will call executeCollect() on WholeStageCodegenExec wrapping GlobalLimitExec when we do collect() on df.limit(1).collect(), for example.

WholeStageCodegenExec.executeCollect() is SparkPlan.executeCollect() actually. So we will do shuffling and retrieve the results. It doesn't harm to anything, but fails few tests, as the Jenkins test results showed.

Of course we can change the tests to fit it. But I don't think it is necessary and good way to do.

Another workaround is to override WholeStageCodegenExec.executeCollect(). But as @rxin pointed out in previous comment, it is confusing.

So based on such facts, I think we better keep CollectLimitExec but just remove its shuffling code as I did in initial commit.

What do you think?

@viirya viirya force-pushed the refactor-collectlimit branch from 492106f to 89c0c62 Compare October 27, 2016 04:23
@viirya viirya changed the title [SPARK-18089][SQL] Remove CollectLimitExec [SPARK-18089][SQL] Remove shuffle codes in CollectLimitExec Oct 27, 2016
@SparkQA
Copy link

SparkQA commented Oct 27, 2016

Test build #67617 has finished for PR 15596 at commit 492106f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CollectLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode

@SparkQA
Copy link

SparkQA commented Oct 27, 2016

Test build #67618 has finished for PR 15596 at commit 89c0c62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CollectLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode

@viirya
Copy link
Member Author

viirya commented Nov 19, 2016

I'd close this now.

@viirya viirya closed this Nov 19, 2016
@viirya viirya deleted the refactor-collectlimit branch December 27, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants