[SPARK-6177][MLlib]Add note in LDA example to remind possible coalesce #4899

hhbyyh · 2015-03-05T05:21:26Z

JIRA: https://issues.apache.org/jira/browse/SPARK-6177
Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from sc.textFile.

sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance.

SparkQA · 2015-03-05T05:22:46Z

Test build #28278 has started for PR 4899 at commit 26a564a.

This patch merges cleanly.

SparkQA · 2015-03-05T06:41:22Z

Test build #28278 has finished for PR 4899 at commit 26a564a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-05T06:41:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28278/
Test PASSed.

srowen · 2015-03-05T12:18:46Z

examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala

That seems reasonable, but as this is a basic example, is it necessary or desirable to always introduce a shuffle here? what if the files aren't numerous, or are large?

I agree; this is a basic example. I like the idea of adding a note telling users they could coalesce if they run this on small documents in practice, but I don't think we should complicate the code too much.

With one document per text file, the files could be pretty big; it just depends on document size.

hhbyyh · 2015-03-06T01:47:38Z

@srowen @jkbradley Thanks for review.
I'm not sure about that coalesce always introduce shuffle, given its signature is
def coalesce(numPartitions: Int, shuffle: Boolean = false). In this case, I thought there will be no shuffle.

Any user trying the LDA example will probably meet the performance downgrade since they could have too big or too small partition number (corresponding to input file number), and they will have to spend time investigating or just rush to a conclusion that LDA is slow. I believe only part of the users is aware of the partition behavior of sc.textFile. (That's why it's better to add partitions number check/warning in LDA.run)

Surely I can close the PR if still it's regarded improper. Right now I believe it helps. Looking forward to your comments.

srowen · 2015-03-06T10:31:57Z

@hhbyyh that's right, you can coalesce to a smaller number of partitions without a shuffle. However, defaultParallelism could be a larger number of partitions. If defaultParallelism is just a little bit smaller than the source partitioning, you'll get some uneven partitions and could actually slow it down.

It's a larger question, really -- how much can you hide these details from the user, and how much do we expect people can or should understand these things to use Spark effectively? I think an example is not held forth as an example of "fastest possible", but just of "essential, basic usage". Partitioning and shuffles are documented elsewhere, and are not essential to LDA usage. That is, putting it in the example implies you must perform this step, and that's not so.

jkbradley · 2015-03-06T16:12:03Z

@hhbyyh Thinking more about it, I think it really depends on what the "usual" usage is. If coalesce is better for the average user's dataset & cluster, then we should keep it since many users will probably copy and paste from the examples to get started. (And vice versa.) I'm really not sure what the most common usage is, hence the move towards simplicity.

But I'm OK either way as long as we include a comment in the example about why users might want to coalesce.

SparkQA · 2015-03-09T01:52:46Z

Test build #28376 has started for PR 4899 at commit 9a2d7b6.

This patch merges cleanly.

hhbyyh · 2015-03-09T01:55:10Z

Thanks a lot for providing the feedback.
Move it to comments as suggested.

SparkQA · 2015-03-09T03:12:37Z

Test build #28376 has finished for PR 4899 at commit 9a2d7b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-09T03:12:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28376/
Test PASSed.

srowen · 2015-03-09T09:57:00Z

Ideally, change the PR description and JIRA to match what this really does. If you don't get to it first, I can do so on merge.

srowen · 2015-03-09T14:21:12Z

examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala

As long as we're documenting this, let's edit this a bit more. It's not guaranteed that there will be a partition per text file. I'd say something more like:

"If the input consists of many small files, this can result in a large number of small partitions, which can degrade performance. In this case, consider using coalesce() to create fewer, larger partitions."

SparkQA · 2015-03-10T03:02:48Z

Test build #28419 has started for PR 4899 at commit a499630.

This patch merges cleanly.

SparkQA · 2015-03-10T04:22:42Z

Test build #28419 has finished for PR 4899 at commit a499630.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-10T04:22:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28419/
Test PASSed.

add coalesce to LDAExample

26a564a

srowen reviewed Mar 5, 2015
View reviewed changes

hhbyyh closed this Mar 7, 2015

Merge remote-tracking branch 'upstream/master' into adjustPartition

f7fd5d4

hhbyyh reopened this Mar 9, 2015

move to comment

9a2d7b6

srowen reviewed Mar 9, 2015
View reviewed changes

hhbyyh changed the title ~~[SPARK-6177][MLlib] LDA should check partitions size of the input~~ [SPARK-6177][MLlib]Add note in LDA example to remind possible coalesce Mar 10, 2015

update comment

a499630

asfgit closed this in 9a0272f Mar 10, 2015

[SPARK-6177][MLlib]Add note in LDA example to remind possible coalesce #4899

[SPARK-6177][MLlib]Add note in LDA example to remind possible coalesce #4899

Uh oh!

Conversation

hhbyyh commented Mar 5, 2015

Uh oh!

SparkQA commented Mar 5, 2015

Uh oh!

SparkQA commented Mar 5, 2015

Uh oh!

AmplabJenkins commented Mar 5, 2015

Uh oh!

srowen Mar 5, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley Mar 5, 2015

Choose a reason for hiding this comment

Uh oh!

hhbyyh commented Mar 6, 2015

Uh oh!

srowen commented Mar 6, 2015

Uh oh!

jkbradley commented Mar 6, 2015

Uh oh!

SparkQA commented Mar 9, 2015

Uh oh!

hhbyyh commented Mar 9, 2015

Uh oh!

SparkQA commented Mar 9, 2015

Uh oh!

AmplabJenkins commented Mar 9, 2015

Uh oh!

srowen commented Mar 9, 2015

Uh oh!

srowen Mar 9, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 10, 2015

Uh oh!

SparkQA commented Mar 10, 2015

Uh oh!

AmplabJenkins commented Mar 10, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants