-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-6177][MLlib]Add note in LDA example to remind possible coalesce #4899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #28278 has started for PR 4899 at commit
|
|
Test build #28278 has finished for PR 4899 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable, but as this is a basic example, is it necessary or desirable to always introduce a shuffle here? what if the files aren't numerous, or are large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; this is a basic example. I like the idea of adding a note telling users they could coalesce if they run this on small documents in practice, but I don't think we should complicate the code too much.
With one document per text file, the files could be pretty big; it just depends on document size.
|
@srowen @jkbradley Thanks for review. Any user trying the LDA example will probably meet the performance downgrade since they could have too big or too small partition number (corresponding to input file number), and they will have to spend time investigating or just rush to a conclusion that LDA is slow. I believe only part of the users is aware of the partition behavior of Surely I can close the PR if still it's regarded improper. Right now I believe it helps. Looking forward to your comments. |
|
@hhbyyh that's right, you can coalesce to a smaller number of partitions without a shuffle. However, It's a larger question, really -- how much can you hide these details from the user, and how much do we expect people can or should understand these things to use Spark effectively? I think an example is not held forth as an example of "fastest possible", but just of "essential, basic usage". Partitioning and shuffles are documented elsewhere, and are not essential to LDA usage. That is, putting it in the example implies you must perform this step, and that's not so. |
|
@hhbyyh Thinking more about it, I think it really depends on what the "usual" usage is. If coalesce is better for the average user's dataset & cluster, then we should keep it since many users will probably copy and paste from the examples to get started. (And vice versa.) I'm really not sure what the most common usage is, hence the move towards simplicity. But I'm OK either way as long as we include a comment in the example about why users might want to coalesce. |
|
Test build #28376 has started for PR 4899 at commit
|
|
Thanks a lot for providing the feedback. |
|
Test build #28376 has finished for PR 4899 at commit
|
|
Test PASSed. |
|
Ideally, change the PR description and JIRA to match what this really does. If you don't get to it first, I can do so on merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we're documenting this, let's edit this a bit more. It's not guaranteed that there will be a partition per text file. I'd say something more like:
"If the input consists of many small files, this can result in a large number of small partitions, which can degrade performance. In this case, consider using coalesce() to create fewer, larger partitions."
|
Test build #28419 has started for PR 4899 at commit
|
|
Test build #28419 has finished for PR 4899 at commit
|
|
Test PASSed. |
JIRA: https://issues.apache.org/jira/browse/SPARK-6177
Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from
sc.textFile.sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance.