-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data. #2347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
See above where I describe how, for python RDDs, the input data is automatically cached and then deserialized via a map to an uncached RDD, requiring deserialization of every row for every training iteration. Would it make sense to change this to cache after deserializing instead of before? If so I can file a new ticket and PR. |
|
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to tell, because the input RDD may be a simple mapped RDD from a cached RDD. Maybe we can change the warning message to The input data is not directly cached, which may hurt the performance if its parent RDDs are not cached either.
|
Sure, I changed the warning message text as you suggested. Do you think the deserialization mapping in the python RDDs I described is ok (a lightweight operation)? If so, I imagine it would be a problem for the warning message to always be printed when python is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) Thanks
|
@staple For Python, I think caching on the JVM side is good. The only thing we need to take care of is that NaiveBayes and DecisionTree doesn't need caching. |
|
Hi, I made the requested comment changes. I also filed a separate PR for the caching changes: #2362 |
|
Is it possible that add the cache for RDD automatically instead of show an warning, if the cache is always helpful? |
|
@davies It is hard to tell whether we already have fast access to the input RDD. Force caching may cause problems, e.g.,
|
|
test this please |
|
this is ok to test |
|
QA tests have started for PR 2347 at commit
|
|
QA tests have finished for PR 2347 at commit
|
|
Hi, per the discussion in #2362 the plan is to continue caching before deserialization from python rather than after, in order to minimize the cached rdd memory footprint. This means that, without further work, warning messages will be logged for every python mllib regression and kmeans run. I added a patch that suppresses these warning messages during python runs in a way that I think is fairly unobtrusive. Please let me know what you think. |
|
QA tests have started for PR 2347 at commit
|
|
QA tests have finished for PR 2347 at commit
|
|
Hi, I addressed the recent review comments and merged. |
|
QA tests have started for PR 2347 at commit
|
|
QA tests have finished for PR 2347 at commit
|
|
Test PASSed. |
|
LGTM. Merged into master. What's your username on JIRA? I'll assign the JIRA to you. Thanks! |
|
Great, thanks. My username is 'staple', looks like you already assigned to me though. |
Add warnings to KMeans, GeneralizedLinearAlgorithm, and computeSVD when called with input data that is not cached. KMeans is implemented iteratively, and I believe that GeneralizedLinearAlgorithm’s current optimizers are iterative and its future optimizers are also likely to be iterative. RowMatrix’s computeSVD is iterative against an RDD when run in DistARPACK mode. ALS and DecisionTree are iterative as well, but they implement RDD caching internally so do not require a warning.
I added a warning to GeneralizedLinearAlgorithm rather than inside its optimizers, where the iteration actually occurs, because internally GeneralizedLinearAlgorithm maps its input data to an uncached RDD before passing it to an optimizer. (In other words, the warning would be printed for every GeneralizedLinearAlgorithm run, regardless of whether its input is cached, if the warning were in GradientDescent or other optimizer.) I assume that use of an uncached RDD by GeneralizedLinearAlgorithm is intentional, and that the mapping there (adding label, intercepts and scaling) is a lightweight operation. Arguably a user calling an optimizer such as GradientDescent will be knowledgable enough to cache their data without needing a log warning, so lack of a warning in the optimizers may be ok.
Some of the documentation examples making use of these iterative algorithms did not cache their training RDDs (while others did). I updated the examples to always cache. I also fixed some (unrelated) minor errors in the documentation examples.