[SPARK-18089][SQL] Remove shuffle codes in CollectLimitExec #15596

viirya · 2016-10-22T03:23:21Z

What changes were proposed in this pull request?

Currently, CollectLimitExec is an operator used when the logical Limit is the last operator in a logical plan. In fact, the job of CollectLimitExec is not different to GlobalLimitExec. We can do little refactoring to GlobalLimitExec and replace CollectLimitExec.

How was this patch tested?

Jenkins tests.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

viirya · 2016-10-22T03:25:05Z

cc @rxin @yhuai May you take a quick look to see if this direction is ok for you? Thanks!

viirya · 2016-10-22T03:29:42Z

sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java

@@ -39,7 +39,7 @@
  protected int partitionIndex = -1;

  public boolean hasNext() throws IOException {
-    if (currentRows.isEmpty()) {
+    if (!shouldStop()) {


shouldStop in whole stage codegen can be override to have custom stop condition other than just currentRows.isEmpty().

For example, in limit, we will stop iterator processing early if limitation is met.

Without this change, this pr will fail one test in SQLQuerySuite: "SPARK-17515: CollectLimit.execute() should perform per-partition limits". Because the LocalLimit will not stop immediately but in next round after reaching limitation.

viirya · 2016-10-22T03:32:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

@@ -83,9 +83,8 @@ trait BaseLimitExec extends UnaryExecNode with CodegenSupport {
    s"""
       | if ($countTerm < $limit) {
       |   $countTerm += 1;
+       |   if ($countTerm == $limit) $stopEarly = true;


Set stop early flag to true once reaching the limit, so we won't step into next element in iterator. Otherwise, we still go to get next element from the iterator. So if the limitation is 1, we will get 2 elements from the iterator.

Note: This won't cause real problem because we have an if to guard against processing element. But it fails the test in SQLQuerySuite due to it uses accumulator to count the elements.

viirya · 2016-10-22T03:34:25Z

BTW, we can see if there is an exchange added for CollectLimit from physical plan, e.g.,

CollectLimit 1
+- Exchange SinglePartition
  +- *LocalLimit 1
     +- *HashAggregate(keys=[str#227], functions=[count(1)], output=[str#227, count(1)#235L])
        +- Exchange hashpartitioning(str#227, 5)
           +- *HashAggregate(keys=[str#227], functions=[partial_count(1)], output=[str#227, count#241L])
              +- *Project [str#227]
                 +- *BroadcastHashJoin [str#227], [str#233], Inner, BuildRight
                    :- *Project [_2#224 AS str#227]
                    :  +- *Filter isnotnull(_2#224)
                    :     +- LocalTableScan [_1#223, _2#224]
                    +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
                       +- *Project [_2#224 AS str#233]
                          +- *Filter isnotnull(_2#224)
                             +- LocalTableScan [_1#223, _2#224]

SparkQA · 2016-10-22T04:46:28Z

Test build #67371 has finished for PR 15596 at commit 3d24f79.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-10-22T15:35:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

  protected override def doExecute(): RDD[InternalRow] = {
-    val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))


We are removing an optimization here right? We can greatly reduce the number of shuffled records by applying the limit before anything gets shuffled.

I use a LocalLimit to do this optimization. I think we should use existing physical plans as much as possible, instead of rdd manipulation.

One advantage is LocalLimit supports whole stage codegen. We can also easily get the idea of this optimization from physical plans.

hvanhovell · 2016-10-23T18:47:41Z

@JoshRosen could you shed some light on why we are not using the regular EnsureRequirements based code path for CollectLimitExec?

SparkQA · 2016-10-24T05:16:12Z

Test build #67427 has finished for PR 15596 at commit 6d7095c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-24T14:53:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

@@ -661,6 +661,15 @@ case class HashAggregateExec(
         """.stripMargin
    }

+    ctx.addNewFunction("releaseResource", s"""


We do need to release memory by calling these functions. Because we stop iterating the data early by letting shouldStop return true, we won't call next processNext and memory leak will happen. So we wrap these calls in override releaseResource function.

viirya · 2016-10-24T14:59:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala

@@ -47,6 +50,22 @@ case class LocalTableScanExec(

  private lazy val rdd = sqlContext.sparkContext.parallelize(unsafeRows, numParallelism)

+  protected override def doProduce(ctx: CodegenContext): String = {


Let LocalTableScanExec support whole stage codegen.

Because CollectLimitExec now supports whole stage codegen, the test in SQLMetricsSuite:

val df2 = spark.createDataset(Seq(1, 2, 3)).limit(2) df2.collect() val metrics2 = df2.queryExecution.executedPlan.collectLeaves().head.metrics assert(metrics2.contains("numOutputRows")) assert(metrics2("numOutputRows").value === 2)

will execute the LocalTableScanExec node to get its RDD. Then an InputAdapter will connect it to CollectLimitExec's whole stage codegen node. So it will output all 3 rows in the local table.

Adding this whole stage code support seems straightforward. So I adds it here to pass the tests.

viirya · 2016-10-24T15:03:12Z

hmmm, I don't image that it is needed for such related changes to remove shuffle codes in CollectLimitExec.

The main change is to modify whole stage codegen of Limit. There is a place in it needed to change in order to exactly iterate the data in the limited number. I left the comments in above.

This is not a serious issue and if you think this change is too big for the purpose, please let me know. Thanks!

SparkQA · 2016-10-24T16:38:49Z

Test build #67452 has finished for PR 15596 at commit 76a3eaf.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-10-24T18:12:59Z

For a bit of background on CollectLimitExec, note that the intent of this operator is to be able to optimize the special case where a limit is the terminal operator of a plan: in this case, we don't need to perform a shuffle because we can have the driver run multiple jobs which scan an increasingly large portion of the RDD to get the limited items; in a nutshell, the goal is to allow a logic similar to the RDD take() action to stop early without having to compute all partitions of the limited RDD.

In typical operation, we don't necessarily expect CollectLimitExec to appear in the middle of a query plan, so CollectLimitExec.execute() should generally only be called in special cases such as calling .rdd() on a limited RDD then performing further operations on it. This is why I didn't use EnsureRequirements here: if we did, then we'd end up shuffling all limited partitions to a single non-driver partition, then limiting that and collecting to the driver, degrading performance in the case where a limit is the terminal operation in the query plan.

JoshRosen · 2016-10-24T18:17:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

-  private val serializer: Serializer = new UnsafeRowSerializer(child.output.size)
+  override def requiredChildDistribution: List[Distribution] = AllTuples :: Nil
+  override def executeCollect(): Array[InternalRow] = {
+    child.collect {


I think that I understand the intent here: it looks like the idea is to let EnsureRequirements handle the shuffling, if necessary, then to strip off the shuffling and walk backwards in the DAG to optimize the collect() case. This is pretty clever.

One concern of mine, though, is that there seems to be an implicit assumption on the types of children that this operator can have and I think it would be a good idea to write them down and pattern-match more explicitly. For example, I'm not sure that this strategy is safe in case the child isn't an exchange because then you run the risk of simply dropping any operators that occur post-exchange. I understand that this case can't crop up in the types of plans that we currently generate but it would be good to future-proof by matching explicitly rather than having the less-constrained collect behavior here.

Also, one other idea: if you're going to try this strategy, what do you think about putting this logic into GlobalLimitExec and removing the CollectLimitExec planning logic?

That seems a good idea. I once thought about the difference between GlobalLimitExec and CollectLimitExec during refactoring this. But decided not to change too much. Looks like I could remove CollectLimitExec.

viirya · 2016-10-25T00:45:42Z

retest this please.

viirya · 2016-10-25T01:18:13Z

In typical operation, we don't necessarily expect CollectLimitExec to appear in the middle of a query plan, so CollectLimitExec.execute() should generally only be called in special cases such as calling .rdd() on a limited RDD then performing further operations on it. This is why I didn't use EnsureRequirements here: if we did, then we'd end up shuffling all limited partitions to a single non-driver partition, then limiting that and collecting to the driver, degrading performance in the case where a limit is the terminal operation in the query plan.

Yeah, I think I understand the intent. So in the executeCollect, I do strip off the shuffling if needed.

SparkQA · 2016-10-25T02:54:45Z

Test build #67480 has finished for PR 15596 at commit 76a3eaf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…imit

rxin · 2016-10-25T06:39:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala

+  override def executeCollect(): Array[InternalRow] = child match {
+    // This happens when the user is collecting results back to the driver, we could skip
+    // the shuffling and scan increasingly the RDD to get the limited items.
+    case g: GlobalLimitExec => g.executeCollect()


this is really confusing ...

hmmm, I am polishing the comment, hope it helpful. Would like to hear any suggestions.

Still think this is confusing as you said. Removed it.

SparkQA · 2016-10-25T07:50:38Z

Test build #67492 has finished for PR 15596 at commit fbf4fd6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-25T09:06:24Z

Test build #67498 has finished for PR 15596 at commit 82ebff4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-25T10:31:15Z

Test build #67501 has finished for PR 15596 at commit 44c64e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-26T06:08:27Z

Test build #67557 has finished for PR 15596 at commit e919f4a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-26T06:31:03Z

Test build #67556 has finished for PR 15596 at commit 110a3e4.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-26T10:04:11Z

Test build #67569 has finished for PR 15596 at commit 360752c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-26T10:18:05Z

Test build #67571 has finished for PR 15596 at commit 86b4e42.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

pwoody · 2016-10-26T17:40:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

 * Helper trait which defines methods that are shared by both
 * [[LocalLimitExec]] and [[GlobalLimitExec]].
 */
 trait BaseLimitExec extends UnaryExecNode with CodegenSupport {
  val limit: Int
  override def output: Seq[Attribute] = child.output
+  override def executeCollect(): Array[InternalRow] = child.executeTake(limit)
+  override def executeTake(n: Int): Array[InternalRow] = child.executeTake(limit)



I think we will want to add in an executeToIterator override as well to make sure we don't cause a shuffle there.

Thanks @pwoody! Agreed. But I am thinking not to replace CollectLimitExec with GlobalLimitExec. The reason is commented below. Let's wait for @JoshRosen's response. If we decide to keep CollectLimitExec, your change at #15614 can be applied then.

viirya · 2016-10-27T04:17:47Z

@JoshRosen After few tries, I think to replace CollectLimitExec with GlobalLimitExec is not a good idea.

The main reason is whole stage codegen. Since GlobalLimitExec supports whole stage codegen, it will be wrapped in a WholeStageCodegenExec. So we will call executeCollect() on WholeStageCodegenExec wrapping GlobalLimitExec when we do collect() on df.limit(1).collect(), for example.

WholeStageCodegenExec.executeCollect() is SparkPlan.executeCollect() actually. So we will do shuffling and retrieve the results. It doesn't harm to anything, but fails few tests, as the Jenkins test results showed.

Of course we can change the tests to fit it. But I don't think it is necessary and good way to do.

Another workaround is to override WholeStageCodegenExec.executeCollect(). But as @rxin pointed out in previous comment, it is confusing.

So based on such facts, I think we better keep CollectLimitExec but just remove its shuffling code as I did in initial commit.

What do you think?

SparkQA · 2016-10-27T06:03:47Z

Test build #67617 has finished for PR 15596 at commit 492106f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CollectLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode

SparkQA · 2016-10-27T06:43:24Z

Test build #67618 has finished for PR 15596 at commit 89c0c62.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CollectLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode

viirya · 2016-11-19T02:15:12Z

I'd close this now.

Remove shuffle codes in CollectLimitExec.

3d24f79

viirya commented Oct 22, 2016

View reviewed changes

hvanhovell reviewed Oct 22, 2016

View reviewed changes

viirya force-pushed the refactor-collectlimit branch from 6d7095c to e8520cf Compare October 24, 2016 14:47

Fix test.

76a3eaf

viirya force-pushed the refactor-collectlimit branch from e8520cf to 76a3eaf Compare October 24, 2016 14:48

viirya commented Oct 24, 2016

View reviewed changes

JoshRosen reviewed Oct 24, 2016

View reviewed changes

JoshRosen mentioned this pull request Oct 24, 2016

[SPARK-18079] [SQL] CollectLimitExec.executeToIterator should perform per-partition limits #15614

Closed

viirya added 2 commits October 25, 2016 06:13

Remove CollectLimitExec.

58e8383

Merge remote-tracking branch 'upstream/master' into refactor-collectl…

fbf4fd6

…imit

viirya changed the title ~~[SQL] Remove shuffle codes in CollectLimitExec~~ [SPARK-18089][SQL] Remove CollectLimitExec Oct 25, 2016

rxin reviewed Oct 25, 2016

View reviewed changes

Polishing comment.

82ebff4

fix test.

44c64e0

viirya force-pushed the refactor-collectlimit branch 3 times, most recently from 53a956e to e919f4a Compare October 26, 2016 04:51

viirya force-pushed the refactor-collectlimit branch from e919f4a to 360752c Compare October 26, 2016 08:17

Minimize the necessary changes.

86b4e42

viirya force-pushed the refactor-collectlimit branch from 360752c to 86b4e42 Compare October 26, 2016 08:28

pwoody reviewed Oct 26, 2016

View reviewed changes

Revert CollectLimitExec back.

89c0c62

viirya force-pushed the refactor-collectlimit branch from 492106f to 89c0c62 Compare October 27, 2016 04:23

viirya changed the title ~~[SPARK-18089][SQL] Remove CollectLimitExec~~ [SPARK-18089][SQL] Remove shuffle codes in CollectLimitExec Oct 27, 2016

viirya closed this Nov 19, 2016

viirya deleted the refactor-collectlimit branch December 27, 2023 18:34

		protected override def doExecute(): RDD[InternalRow] = {
		val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit))

		@@ -47,6 +50,22 @@ case class LocalTableScanExec(

		private lazy val rdd = sqlContext.sparkContext.parallelize(unsafeRows, numParallelism)

		protected override def doProduce(ctx: CodegenContext): String = {

[SPARK-18089][SQL] Remove shuffle codes in CollectLimitExec #15596

[SPARK-18089][SQL] Remove shuffle codes in CollectLimitExec #15596

Conversation

viirya commented Oct 22, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

viirya commented Oct 22, 2016

Choose a reason for hiding this comment

viirya Oct 22, 2016 • edited Loading

Choose a reason for hiding this comment

viirya commented Oct 22, 2016

SparkQA commented Oct 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell commented Oct 23, 2016

SparkQA commented Oct 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Oct 24, 2016 • edited Loading

SparkQA commented Oct 24, 2016

JoshRosen commented Oct 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Oct 25, 2016

viirya commented Oct 25, 2016

SparkQA commented Oct 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 25, 2016

SparkQA commented Oct 25, 2016

SparkQA commented Oct 25, 2016

SparkQA commented Oct 26, 2016

SparkQA commented Oct 26, 2016

SparkQA commented Oct 26, 2016

SparkQA commented Oct 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Oct 27, 2016

SparkQA commented Oct 27, 2016

SparkQA commented Oct 27, 2016

viirya commented Nov 19, 2016

viirya commented Oct 22, 2016 •

edited

Loading

viirya Oct 22, 2016 •

edited

Loading

viirya commented Oct 24, 2016 •

edited

Loading