-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python #15068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| private[sql] def collectToPython(): Int = { | ||
| EvaluatePython.registerPicklers() | ||
| withNewExecutionId { | ||
| PythonRDD.collectAndServe(javaToPython.rdd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that collectAndServe was internally calling rdd.collect().iterator, so this patch's change doesn't have any impact on the maximum number of rows that need to be buffered in the JVM; this would not have been the case if the old code used collectAndServeIterator.
|
Test build #65288 has finished for PR 15068 at commit
|
|
Test build #65290 has finished for PR 15068 at commit
|
## What changes were proposed in this pull request? CollectLimit.execute() incorrectly omits per-partition limits, leading to performance regressions in case this case is hit (which should not happen in normal operation, but can occur in some cases (see #15068 for one example). ## How was this patch tested? Regression test in SQLQuerySuite that asserts the number of records scanned from the input RDD. Author: Josh Rosen <joshrosen@databricks.com> Closes #15070 from JoshRosen/SPARK-17515.
CollectLimit.execute() incorrectly omits per-partition limits, leading to performance regressions in case this case is hit (which should not happen in normal operation, but can occur in some cases (see #15068 for one example). Regression test in SQLQuerySuite that asserts the number of records scanned from the input RDD. Author: Josh Rosen <joshrosen@databricks.com> Closes #15070 from JoshRosen/SPARK-17515. (cherry picked from commit 3f6a2bb) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
|
@JoshRosen The current patch looks good to me, could you also fix the case that LocalLimit is inserted when we turn a DataFrame with limit into Python RDD? |
|
@JoshRosen Just saw the other patch, LGTM |
|
Merging this into master and 2.0. |
… same in Python ## What changes were proposed in this pull request? In PySpark, `df.take(1)` runs a single-stage job which computes only one partition of the DataFrame, while `df.limit(1).collect()` computes all partitions and runs a two-stage job. This difference in performance is confusing. The reason why `limit(1).collect()` is so much slower is that `collect()` internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a `take()` job on the driver (this was done in #7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.). In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / #8876 so that `DataFrame.collect()` also delegates to the Scala implementation and shares the same performance properties. This patch modifies `DataFrame.collect()` to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's `CollectLimit` optimizations. ## How was this patch tested? Added a regression test in `sql/tests.py` which asserts that the expected number of jobs, stages, and tasks are run for both queries. Author: Josh Rosen <joshrosen@databricks.com> Closes #15068 from JoshRosen/pyspark-collect-limit. (cherry picked from commit 6d06ff6) Signed-off-by: Davies Liu <davies.liu@gmail.com>
## What changes were proposed in this pull request? CollectLimit.execute() incorrectly omits per-partition limits, leading to performance regressions in case this case is hit (which should not happen in normal operation, but can occur in some cases (see apache#15068 for one example). ## How was this patch tested? Regression test in SQLQuerySuite that asserts the number of records scanned from the input RDD. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#15070 from JoshRosen/SPARK-17515.
… same in Python ## What changes were proposed in this pull request? In PySpark, `df.take(1)` runs a single-stage job which computes only one partition of the DataFrame, while `df.limit(1).collect()` computes all partitions and runs a two-stage job. This difference in performance is confusing. The reason why `limit(1).collect()` is so much slower is that `collect()` internally maps to `df.rdd.<some-pyspark-conversions>.toLocalIterator`, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running a `take()` job on the driver (this was done in apache#7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.). In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / apache#8876 so that `DataFrame.collect()` also delegates to the Scala implementation and shares the same performance properties. This patch modifies `DataFrame.collect()` to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark's `CollectLimit` optimizations. ## How was this patch tested? Added a regression test in `sql/tests.py` which asserts that the expected number of jobs, stages, and tasks are run for both queries. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#15068 from JoshRosen/pyspark-collect-limit.
What changes were proposed in this pull request?
In PySpark,
df.take(1)runs a single-stage job which computes only one partition of the DataFrame, whiledf.limit(1).collect()computes all partitions and runs a two-stage job. This difference in performance is confusing.The reason why
limit(1).collect()is so much slower is thatcollect()internally maps todf.rdd.<some-pyspark-conversions>.toLocalIterator, which causes Spark SQL to build a query where a global limit appears in the middle of the plan; this, in turn, ends up being executed inefficiently because limits in the middle of plans are now implemented by repartitioning to a single task rather than by running atake()job on the driver (this was done in #7334, a patch which was a prerequisite to allowing partition-local limits to be pushed beneath unions, etc.).In order to fix this performance problem I think that we should generalize the fix from SPARK-10731 / #8876 so that
DataFrame.collect()also delegates to the Scala implementation and shares the same performance properties. This patch modifiesDataFrame.collect()to first collect all results to the driver and then pass them to Python, allowing this query to be planned using Spark'sCollectLimitoptimizations.How was this patch tested?
Added a regression test in
sql/tests.pywhich asserts that the expected number of jobs, stages, and tasks are run for both queries.