[SPARK-17817][PySpark] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes #15389

viirya · 2016-10-07T05:10:58Z

What changes were proposed in this pull request?

Quoted from JIRA description:

Calling repartition on a PySpark RDD to increase the number of partitions results in highly skewed partition sizes, with most having 0 rows. The repartition method should evenly spread out the rows across the partitions, and this behavior is correctly seen on the Scala side.

Please reference the following code for a reproducible example of this issue:

num_partitions = 20000
a = sc.parallelize(range(int(1e6)), 2)  # start with 2 even partitions
l = a.repartition(num_partitions).glom().map(len).collect()  # get length of each partition
min(l), max(l), sum(l)/len(l), len(l)  # skewed!

In Scala's repartition code, we will distribute elements evenly across output partitions. However, the RDD from Python is serialized as a single binary data, so the distribution fails. We need to convert the RDD in Python to java object before repartitioning.

How was this patch tested?

Jenkins tests.

SparkQA · 2016-10-07T05:13:27Z

Test build #66485 has finished for PR 15389 at commit be8c509.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-07T06:47:44Z

Test build #66486 has finished for PR 15389 at commit 04c42ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-07T06:48:31Z

cc @davies

holdenk

Thanks for working on this - yay better partitioning for Python :) Some minor comments and it might make sense to do a quick benchmark to make sure we don't have any unintentional regression here?

holdenk · 2016-10-07T18:25:23Z

python/pyspark/rdd.py

        [[1, 2, 3, 4, 5]]
        """
-        jrdd = self._jrdd.coalesce(numPartitions, shuffle)
+        if shuffle:


Seems you could just call repartition here to avoid the code duplication or swap repartition to call coalesce.

Yeah to be consistent with the Scala side, it would be nice to rearrange this to have Python's repartition(...) call Python's coalesce(..., shuffle=True).

Make sense. Updated.

holdenk · 2016-10-07T18:27:51Z

python/pyspark/rdd.py

-        jrdd = self._jrdd.coalesce(numPartitions, shuffle)
+        if shuffle:
+            data_java_rdd = self._to_java_object_rdd().coalesce(numPartitions, shuffle)
+            jrdd = self.ctx._jvm.SerDeUtil.javaToPython(data_java_rdd)


I'm not as familiar with this part as I should be, but do we have a good idea of how expensive this is? It might be good to do some quick benchmarking just to make sure that this change doesn't have any unintended side effects?

I do a simple benchmark:

import time num_partitions = 20000 a = sc.parallelize(range(int(1e6)), 2) start = time.time() l = a.repartition(num_partitions).glom().map(len).collect() end = time.time() print(end - start)

Before: 419.447577953
After: 421.916361094

I think there is no significant difference.

Yah that seems close enough we don't need to worry (and for the big cases presumably the impact of having better balanced partitions is well worth the slight overhead).

dusenberrymw

Thanks @viirya for jumping on this so quickly! I agree with a few thoughts by @holdenk, and overall looking forward to making use of this fix!

SparkQA · 2016-10-08T01:51:54Z

Test build #66550 has finished for PR 15389 at commit 27d7c84.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-10-08T02:50:45Z

This looks good to me, one alternative is that we could try and fix it by doing better shuffling of the batched chunks but this wouldn't work well for increasing the number of partitions.
Maybe @davies can take a look and see if there is anything else that needs to be done?

holdenk · 2016-10-09T04:11:05Z

Maybe @HyukjinKwon could also do a review pass while we wait for @davies or someone with commit privileges to come by and do a final review.

HyukjinKwon · 2016-10-09T06:33:16Z

@holdenk Thanks you for cc'ing me. It looks okay to me as targeted but I feel we need a sign-off.

viirya · 2016-10-09T06:36:49Z

@holdenk @dusenberrymw @HyukjinKwon Thanks for review!

felixcheung · 2016-10-09T22:09:36Z

great! LGTM and thank you for the thorough review/test/feedback from everyone.
let's hold for another day or 2 :)

rxin · 2016-10-10T04:56:05Z

python/pyspark/rdd.py

        """
-        jrdd = self._jrdd.coalesce(numPartitions, shuffle)
+        if shuffle:
+            data_java_rdd = self._to_java_object_rdd().coalesce(numPartitions, shuffle)


would be great to add some inline comment explaining why this is necessary. otherwise somebody can just come in 6 month from now and change this back to jrdd = self._jrdd.coalesce(numPartitions, shuffle)

Comment added. Thank you.

SparkQA · 2016-10-10T10:54:56Z

Test build #66645 has finished for PR 15389 at commit 9ce572e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-10T13:00:34Z

Test build #3306 has finished for PR 15389 at commit 9ce572e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-10T13:14:41Z

retest this please.

SparkQA · 2016-10-10T13:40:00Z

Test build #66651 has finished for PR 15389 at commit 9ce572e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-10T13:43:06Z

Seems Jenkins are not in working status?

viirya · 2016-10-10T13:43:15Z

retest this please.

SparkQA · 2016-10-10T14:24:25Z

Test build #66654 has finished for PR 15389 at commit 9ce572e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-10-11T21:12:55Z

python/pyspark/rdd.py

+            # partitions. However, the RDD from Python is serialized as a single binary data,
+            # so the distribution fails and produces highly skewed partitions. We need to
+            # convert it to a RDD of java object before repartitioning.
+            data_java_rdd = self._to_java_object_rdd().coalesce(numPartitions, shuffle)


The reason that cause the skew should be large batch size, I think we could decrease the batch size to 10, then call repartition in JVM.

My worry is that _to_java_object_rdd() could be expensive, maybe we should have some benchmark for that.

Hi @davies, actually it seems a simple benchmark was done in #15389 (comment)

If you worry, then, I'd like to proceed another benchmark with larger data and then will share when I have some time.

@davies Thank you! I do a simple benchmark as above with decreasing the batch size, I don't see an improvement in running time. I.e.,

import time num_partitions = 20000 a = sc.parallelize(range(int(1e6)), 2) start = time.time() l = a.repartition(num_partitions).glom().map(len).collect() end = time.time() print(end - start)

Before: 419.447577953
_to_java_object_rdd(): 421.916361094
decreasing the batch size: 423.712255955

Maybe it depends how is expensive actually converting to java object case by case. Is it generally faster than _to_java_object_rdd()? I would open a followup for this change.

@davies The followup is at #15445. Can you take a look? Thanks!

Should we have a benchmark to Complicated types (match the assumption that serialization is not trivial)?

davies · 2016-10-11T21:15:40Z

Did not release that we already merged this, should we left a message here or in the JIRA so we can know who merge this?

rxin · 2016-10-11T21:20:16Z

@felixcheung merged it I believe.

@felixcheung please make sure you leave a msg saying it's merged (along with the branch) when you merge prs.

felixcheung · 2016-10-12T17:24:39Z

@rxin sorry, it's me and I thought commit history and JIRA would show who did it - I'll be sure to add a note the next time.
@davies sorry I'll make sure to ask to wait a bit longer for you the next time 😄

This was merged to master.

… in Highly Skewed Partition Sizes ## What changes were proposed in this pull request? This change is a followup for #15389 which calls `_to_java_object_rdd()` to solve this issue. Due to the concern of the possible expensive cost of the call, we can choose to decrease the batch size to solve this issue too. Simple benchmark: import time num_partitions = 20000 a = sc.parallelize(range(int(1e6)), 2) start = time.time() l = a.repartition(num_partitions).glom().map(len).collect() end = time.time() print(end - start) Before: 419.447577953 _to_java_object_rdd(): 421.916361094 decreasing the batch size: 423.712255955 ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #15445 from viirya/repartition-batch-size.

… in Highly Skewed Partition Sizes ## What changes were proposed in this pull request? This change is a followup for apache#15389 which calls `_to_java_object_rdd()` to solve this issue. Due to the concern of the possible expensive cost of the call, we can choose to decrease the batch size to solve this issue too. Simple benchmark: import time num_partitions = 20000 a = sc.parallelize(range(int(1e6)), 2) start = time.time() l = a.repartition(num_partitions).glom().map(len).collect() end = time.time() print(end - start) Before: 419.447577953 _to_java_object_rdd(): 421.916361094 decreasing the batch size: 423.712255955 ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#15445 from viirya/repartition-batch-size.

…kewed Partition Sizes ## What changes were proposed in this pull request? Quoted from JIRA description: Calling repartition on a PySpark RDD to increase the number of partitions results in highly skewed partition sizes, with most having 0 rows. The repartition method should evenly spread out the rows across the partitions, and this behavior is correctly seen on the Scala side. Please reference the following code for a reproducible example of this issue: num_partitions = 20000 a = sc.parallelize(range(int(1e6)), 2) # start with 2 even partitions l = a.repartition(num_partitions).glom().map(len).collect() # get length of each partition min(l), max(l), sum(l)/len(l), len(l) # skewed! In Scala's `repartition` code, we will distribute elements evenly across output partitions. However, the RDD from Python is serialized as a single binary data, so the distribution fails. We need to convert the RDD in Python to java object before repartitioning. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#15389 from viirya/pyspark-rdd-repartition.

… in Highly Skewed Partition Sizes ## What changes were proposed in this pull request? This change is a followup for apache#15389 which calls `_to_java_object_rdd()` to solve this issue. Due to the concern of the possible expensive cost of the call, we can choose to decrease the batch size to solve this issue too. Simple benchmark: import time num_partitions = 20000 a = sc.parallelize(range(int(1e6)), 2) start = time.time() l = a.repartition(num_partitions).glom().map(len).collect() end = time.time() print(end - start) Before: 419.447577953 _to_java_object_rdd(): 421.916361094 decreasing the batch size: 423.712255955 ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#15445 from viirya/repartition-batch-size.

Fix pyspark.rdd repartition.

be8c509

Fix python style.

04c42ee

holdenk reviewed Oct 7, 2016

View reviewed changes

dusenberrymw reviewed Oct 7, 2016

View reviewed changes

Call coalesce in repartition.

27d7c84

Add inline comments.

9ce572e

rxin reviewed Oct 10, 2016

View reviewed changes

asfgit closed this in 07508bd Oct 11, 2016

davies reviewed Oct 11, 2016

View reviewed changes

viirya mentioned this pull request Oct 12, 2016

[SPARK-17817][PySpark][FOLLOWUP] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes #15445

Closed

viirya deleted the pyspark-rdd-repartition branch December 27, 2023 18:20

[SPARK-17817][PySpark] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes #15389

[SPARK-17817][PySpark] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes #15389

Uh oh!

Conversation

viirya commented Oct 7, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 7, 2016

Uh oh!

SparkQA commented Oct 7, 2016

Uh oh!

viirya commented Oct 7, 2016

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dusenberrymw Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dusenberrymw left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 8, 2016

Uh oh!

holdenk commented Oct 8, 2016

Uh oh!

holdenk commented Oct 9, 2016

Uh oh!

HyukjinKwon commented Oct 9, 2016

Uh oh!

viirya commented Oct 9, 2016

Uh oh!

felixcheung commented Oct 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

viirya commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

viirya commented Oct 10, 2016

Uh oh!

viirya commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davies commented Oct 11, 2016

Uh oh!

rxin commented Oct 11, 2016

Uh oh!

felixcheung commented Oct 12, 2016

Uh oh!

dusenberrymw Oct 7, 2016 •

edited

Loading