Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Mar 5, 2015

JIRA: https://issues.apache.org/jira/browse/SPARK-6177
Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from sc.textFile.

sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28278 has started for PR 4899 at commit 26a564a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 5, 2015

Test build #28278 has finished for PR 4899 at commit 26a564a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28278/
Test PASSed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable, but as this is a basic example, is it necessary or desirable to always introduce a shuffle here? what if the files aren't numerous, or are large?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; this is a basic example. I like the idea of adding a note telling users they could coalesce if they run this on small documents in practice, but I don't think we should complicate the code too much.

With one document per text file, the files could be pretty big; it just depends on document size.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 6, 2015

@srowen @jkbradley Thanks for review.
I'm not sure about that coalesce always introduce shuffle, given its signature is
def coalesce(numPartitions: Int, shuffle: Boolean = false). In this case, I thought there will be no shuffle.

Any user trying the LDA example will probably meet the performance downgrade since they could have too big or too small partition number (corresponding to input file number), and they will have to spend time investigating or just rush to a conclusion that LDA is slow. I believe only part of the users is aware of the partition behavior of sc.textFile. (That's why it's better to add partitions number check/warning in LDA.run)

Surely I can close the PR if still it's regarded improper. Right now I believe it helps. Looking forward to your comments.

@srowen
Copy link
Member

srowen commented Mar 6, 2015

@hhbyyh that's right, you can coalesce to a smaller number of partitions without a shuffle. However, defaultParallelism could be a larger number of partitions. If defaultParallelism is just a little bit smaller than the source partitioning, you'll get some uneven partitions and could actually slow it down.

It's a larger question, really -- how much can you hide these details from the user, and how much do we expect people can or should understand these things to use Spark effectively? I think an example is not held forth as an example of "fastest possible", but just of "essential, basic usage". Partitioning and shuffles are documented elsewhere, and are not essential to LDA usage. That is, putting it in the example implies you must perform this step, and that's not so.

@jkbradley
Copy link
Member

@hhbyyh Thinking more about it, I think it really depends on what the "usual" usage is. If coalesce is better for the average user's dataset & cluster, then we should keep it since many users will probably copy and paste from the examples to get started. (And vice versa.) I'm really not sure what the most common usage is, hence the move towards simplicity.

But I'm OK either way as long as we include a comment in the example about why users might want to coalesce.

@hhbyyh hhbyyh closed this Mar 7, 2015
@hhbyyh hhbyyh reopened this Mar 9, 2015
@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28376 has started for PR 4899 at commit 9a2d7b6.

  • This patch merges cleanly.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 9, 2015

Thanks a lot for providing the feedback.
Move it to comments as suggested.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28376 has finished for PR 4899 at commit 9a2d7b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28376/
Test PASSed.

@srowen
Copy link
Member

srowen commented Mar 9, 2015

Ideally, change the PR description and JIRA to match what this really does. If you don't get to it first, I can do so on merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we're documenting this, let's edit this a bit more. It's not guaranteed that there will be a partition per text file. I'd say something more like:

"If the input consists of many small files, this can result in a large number of small partitions, which can degrade performance. In this case, consider using coalesce() to create fewer, larger partitions."

@hhbyyh hhbyyh changed the title [SPARK-6177][MLlib] LDA should check partitions size of the input [SPARK-6177][MLlib]Add note in LDA example to remind possible coalesce Mar 10, 2015
@SparkQA
Copy link

SparkQA commented Mar 10, 2015

Test build #28419 has started for PR 4899 at commit a499630.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 10, 2015

Test build #28419 has finished for PR 4899 at commit a499630.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28419/
Test PASSed.

@asfgit asfgit closed this in 9a0272f Mar 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants