Skip to content

Conversation

@staple
Copy link
Contributor

@staple staple commented Sep 10, 2014

Add warnings to KMeans, GeneralizedLinearAlgorithm, and computeSVD when called with input data that is not cached. KMeans is implemented iteratively, and I believe that GeneralizedLinearAlgorithm’s current optimizers are iterative and its future optimizers are also likely to be iterative. RowMatrix’s computeSVD is iterative against an RDD when run in DistARPACK mode. ALS and DecisionTree are iterative as well, but they implement RDD caching internally so do not require a warning.

I added a warning to GeneralizedLinearAlgorithm rather than inside its optimizers, where the iteration actually occurs, because internally GeneralizedLinearAlgorithm maps its input data to an uncached RDD before passing it to an optimizer. (In other words, the warning would be printed for every GeneralizedLinearAlgorithm run, regardless of whether its input is cached, if the warning were in GradientDescent or other optimizer.) I assume that use of an uncached RDD by GeneralizedLinearAlgorithm is intentional, and that the mapping there (adding label, intercepts and scaling) is a lightweight operation. Arguably a user calling an optimizer such as GradientDescent will be knowledgable enough to cache their data without needing a log warning, so lack of a warning in the optimizers may be ok.

Some of the documentation examples making use of these iterative algorithms did not cache their training RDDs (while others did). I updated the examples to always cache. I also fixed some (unrelated) minor errors in the documentation examples.

@staple
Copy link
Contributor Author

staple commented Sep 10, 2014

See above where I describe how, for python RDDs, the input data is automatically cached and then deserialized via a map to an uncached RDD, requiring deserialization of every row for every training iteration. Would it make sense to change this to cache after deserializing instead of before? If so I can file a new ticket and PR.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

Can one of the admins verify this patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to tell, because the input RDD may be a simple mapped RDD from a cached RDD. Maybe we can change the warning message to The input data is not directly cached, which may hurt the performance if its parent RDDs are not cached either.

@staple
Copy link
Contributor Author

staple commented Sep 10, 2014

Sure, I changed the warning message text as you suggested.

Do you think the deserialization mapping in the python RDDs I described is ok (a lightweight operation)? If so, I imagine it would be a problem for the warning message to always be printed when python is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) Thanks

@mengxr
Copy link
Contributor

mengxr commented Sep 10, 2014

@staple For Python, I think caching on the JVM side is good. The only thing we need to take care of is that NaiveBayes and DecisionTree doesn't need caching.

@staple
Copy link
Contributor Author

staple commented Sep 11, 2014

Hi, I made the requested comment changes. I also filed a separate PR for the caching changes: #2362

@davies
Copy link
Contributor

davies commented Sep 11, 2014

Is it possible that add the cache for RDD automatically instead of show an warning, if the cache is always helpful?

@mengxr
Copy link
Contributor

mengxr commented Sep 12, 2014

@davies It is hard to tell whether we already have fast access to the input RDD. Force caching may cause problems, e.g.,

  1. kicking out some cached RDDs,
  2. using too much memory if the input data is large but it could be generated from a small RDD.

@mengxr
Copy link
Contributor

mengxr commented Sep 16, 2014

test this please

@mengxr
Copy link
Contributor

mengxr commented Sep 16, 2014

this is ok to test

@SparkQA
Copy link

SparkQA commented Sep 16, 2014

QA tests have started for PR 2347 at commit 03d0e2f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 16, 2014

QA tests have finished for PR 2347 at commit 03d0e2f.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@staple
Copy link
Contributor Author

staple commented Sep 16, 2014

Hi, per the discussion in #2362 the plan is to continue caching before deserialization from python rather than after, in order to minimize the cached rdd memory footprint.

This means that, without further work, warning messages will be logged for every python mllib regression and kmeans run. I added a patch that suppresses these warning messages during python runs in a way that I think is fairly unobtrusive. Please let me know what you think.

@SparkQA
Copy link

SparkQA commented Sep 16, 2014

QA tests have started for PR 2347 at commit 9bed1fd.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 16, 2014

QA tests have finished for PR 2347 at commit 9bed1fd.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@staple
Copy link
Contributor Author

staple commented Sep 25, 2014

Hi, I addressed the recent review comments and merged.

@SparkQA
Copy link

SparkQA commented Sep 25, 2014

QA tests have started for PR 2347 at commit bd49701.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 25, 2014

QA tests have finished for PR 2347 at commit bd49701.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20819/

@mengxr
Copy link
Contributor

mengxr commented Sep 25, 2014

LGTM. Merged into master. What's your username on JIRA? I'll assign the JIRA to you. Thanks!

@asfgit asfgit closed this in ff637c9 Sep 25, 2014
@staple
Copy link
Contributor Author

staple commented Sep 26, 2014

Great, thanks. My username is 'staple', looks like you already assigned to me though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants