[SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python #15068

JoshRosen · 2016-09-13T00:55:59Z

What changes were proposed in this pull request?

In PySpark, df.take(1) runs a single-stage job which computes only one partition of the DataFrame, while df.limit(1).collect() computes all partitions and runs a two-stage job. This difference in performance is confusing.

The reason why limit(1).collect() is so much slower is that collect() internally maps to df.rdd.<some-pyspark-conversions>.toLocalIterator, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a take() job on the driver (this was done in #7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.).

In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / #8876 so that DataFrame.collect() also delegates to the Scala implementation and shares the same performance properties. This patch modifies DataFrame.collect() to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's CollectLimit optimizations.

How was this patch tested?

Added a regression test in sql/tests.py which asserts that the expected number of jobs, stages, and tasks are run for both queries.

JoshRosen · 2016-09-13T00:57:21Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

  private[sql] def collectToPython(): Int = {
+    EvaluatePython.registerPicklers()
    withNewExecutionId {
-      PythonRDD.collectAndServe(javaToPython.rdd)


Note that collectAndServe was internally calling rdd.collect().iterator, so this patch's change doesn't have any impact on the maximum number of rows that need to be buffered in the JVM; this would not have been the case if the old code used collectAndServeIterator.

JoshRosen · 2016-09-13T01:00:57Z

/cc @davies @rxin for review.

SparkQA · 2016-09-13T02:57:09Z

Test build #65288 has finished for PR 15068 at commit 6fa9d92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-13T02:59:33Z

Test build #65290 has finished for PR 15068 at commit 07e4c8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? CollectLimit.execute() incorrectly omits per-partition limits, leading to performance regressions in case this case is hit (which should not happen in normal operation, but can occur in some cases (see #15068 for one example). ## How was this patch tested? Regression test in SQLQuerySuite that asserts the number of records scanned from the input RDD. Author: Josh Rosen <joshrosen@databricks.com> Closes #15070 from JoshRosen/SPARK-17515.

CollectLimit.execute() incorrectly omits per-partition limits, leading to performance regressions in case this case is hit (which should not happen in normal operation, but can occur in some cases (see #15068 for one example). Regression test in SQLQuerySuite that asserts the number of records scanned from the input RDD. Author: Josh Rosen <joshrosen@databricks.com> Closes #15070 from JoshRosen/SPARK-17515. (cherry picked from commit 3f6a2bb) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

davies · 2016-09-14T16:53:22Z

@JoshRosen The current patch looks good to me, could you also fix the case that LocalLimit is inserted when we turn a DataFrame with limit into Python RDD?

davies · 2016-09-14T17:07:04Z

@JoshRosen Just saw the other patch, LGTM

davies · 2016-09-14T17:09:28Z

Merging this into master and 2.0.

… same in Python ## What changes were proposed in this pull request? In PySpark, `df.take(1)` runs a single-stage job which computes only one partition of the DataFrame, while `df.limit(1).collect()` computes all partitions and runs a two-stage job. This difference in performance is confusing. The reason why `limit(1).collect()` is so much slower is that `collect()` internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a `take()` job on the driver (this was done in #7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.). In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / #8876 so that `DataFrame.collect()` also delegates to the Scala implementation and shares the same performance properties. This patch modifies `DataFrame.collect()` to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's `CollectLimit` optimizations. ## How was this patch tested? Added a regression test in `sql/tests.py` which asserts that the expected number of jobs, stages, and tasks are run for both queries. Author: Josh Rosen <joshrosen@databricks.com> Closes #15068 from JoshRosen/pyspark-collect-limit. (cherry picked from commit 6d06ff6) Signed-off-by: Davies Liu <davies.liu@gmail.com>

## What changes were proposed in this pull request? CollectLimit.execute() incorrectly omits per-partition limits, leading to performance regressions in case this case is hit (which should not happen in normal operation, but can occur in some cases (see apache#15068 for one example). ## How was this patch tested? Regression test in SQLQuerySuite that asserts the number of records scanned from the input RDD. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#15070 from JoshRosen/SPARK-17515.

… same in Python ## What changes were proposed in this pull request? In PySpark, `df.take(1)` runs a single-stage job which computes only one partition of the DataFrame, while `df.limit(1).collect()` computes all partitions and runs a two-stage job. This difference in performance is confusing. The reason why `limit(1).collect()` is so much slower is that `collect()` internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a `take()` job on the driver (this was done in apache#7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.). In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / apache#8876 so that `DataFrame.collect()` also delegates to the Scala implementation and shares the same performance properties. This patch modifies `DataFrame.collect()` to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's `CollectLimit` optimizations. ## How was this patch tested? Added a regression test in `sql/tests.py` which asserts that the expected number of jobs, stages, and tasks are run for both queries. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#15068 from JoshRosen/pyspark-collect-limit.

JoshRosen added 2 commits September 12, 2016 17:37

Add regression test.

44984a7

Implement PySpark take as limit + collect.

6fa9d92

JoshRosen reviewed Sep 13, 2016
View reviewed changes

Update Dataset.scala

07e4c8f

JoshRosen mentioned this pull request Sep 13, 2016

[SPARK-17515] CollectLimit.execute() should perform per-partition limits #15070

Closed

asfgit closed this in 6d06ff6 Sep 14, 2016

JoshRosen deleted the pyspark-collect-limit branch September 14, 2016 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python #15068

[SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python #15068

Uh oh!

JoshRosen commented Sep 13, 2016 •

edited

Loading

Uh oh!

JoshRosen Sep 13, 2016

Uh oh!

JoshRosen commented Sep 13, 2016

Uh oh!

SparkQA commented Sep 13, 2016

Uh oh!

SparkQA commented Sep 13, 2016

Uh oh!

davies commented Sep 14, 2016

Uh oh!

davies commented Sep 14, 2016

Uh oh!

davies commented Sep 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python #15068

[SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python #15068

Uh oh!

Conversation

JoshRosen commented Sep 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

JoshRosen Sep 13, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Sep 13, 2016

Uh oh!

SparkQA commented Sep 13, 2016

Uh oh!

SparkQA commented Sep 13, 2016

Uh oh!

davies commented Sep 14, 2016

Uh oh!

davies commented Sep 14, 2016

Uh oh!

davies commented Sep 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JoshRosen commented Sep 13, 2016 •

edited

Loading