[SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data. #2347

staple · 2014-09-10T16:02:25Z

Add warnings to KMeans, GeneralizedLinearAlgorithm, and computeSVD when called with input data that is not cached. KMeans is implemented iteratively, and I believe that GeneralizedLinearAlgorithm’s current optimizers are iterative and its future optimizers are also likely to be iterative. RowMatrix’s computeSVD is iterative against an RDD when run in DistARPACK mode. ALS and DecisionTree are iterative as well, but they implement RDD caching internally so do not require a warning.

I added a warning to GeneralizedLinearAlgorithm rather than inside its optimizers, where the iteration actually occurs, because internally GeneralizedLinearAlgorithm maps its input data to an uncached RDD before passing it to an optimizer. (In other words, the warning would be printed for every GeneralizedLinearAlgorithm run, regardless of whether its input is cached, if the warning were in GradientDescent or other optimizer.) I assume that use of an uncached RDD by GeneralizedLinearAlgorithm is intentional, and that the mapping there (adding label, intercepts and scaling) is a lightweight operation. Arguably a user calling an optimizer such as GradientDescent will be knowledgable enough to cache their data without needing a log warning, so lack of a warning in the optimizers may be ok.

Some of the documentation examples making use of these iterative algorithms did not cache their training RDDs (while others did). I updated the examples to always cache. I also fixed some (unrelated) minor errors in the documentation examples.

staple · 2014-09-10T16:02:57Z

See above where I describe how, for python RDDs, the input data is automatically cached and then deserialized via a map to an uncached RDD, requiring deserialization of every row for every training iteration. Would it make sense to change this to cache after deserializing instead of before? If so I can file a new ticket and PR.

SparkQA · 2014-09-10T16:50:03Z

Can one of the admins verify this patch?

mengxr · 2014-09-10T17:24:56Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

This is hard to tell, because the input RDD may be a simple mapped RDD from a cached RDD. Maybe we can change the warning message to The input data is not directly cached, which may hurt the performance if its parent RDDs are not cached either.

staple · 2014-09-10T18:43:54Z

Sure, I changed the warning message text as you suggested.

Do you think the deserialization mapping in the python RDDs I described is ok (a lightweight operation)? If so, I imagine it would be a problem for the warning message to always be printed when python is used.

mengxr · 2014-09-10T20:53:05Z

docs/mllib-linear-methods.md

Nice catch!

mengxr · 2014-09-10T20:55:28Z

@staple For Python, I think caching on the JVM side is good. The only thing we need to take care of is that NaiveBayes and DecisionTree doesn't need caching.

staple · 2014-09-11T16:14:18Z

Hi, I made the requested comment changes. I also filed a separate PR for the caching changes: #2362

davies · 2014-09-11T18:37:16Z

Is it possible that add the cache for RDD automatically instead of show an warning, if the cache is always helpful?

mengxr · 2014-09-12T08:29:43Z

@davies It is hard to tell whether we already have fast access to the input RDD. Force caching may cause problems, e.g.,

kicking out some cached RDDs,
using too much memory if the input data is large but it could be generated from a small RDD.

mengxr · 2014-09-16T02:03:12Z

test this please

mengxr · 2014-09-16T06:45:03Z

this is ok to test

SparkQA · 2014-09-16T06:49:12Z

QA tests have started for PR 2347 at commit 03d0e2f.

This patch merges cleanly.

SparkQA · 2014-09-16T07:56:26Z

QA tests have finished for PR 2347 at commit 03d0e2f.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

staple · 2014-09-16T14:07:21Z

Hi, per the discussion in #2362 the plan is to continue caching before deserialization from python rather than after, in order to minimize the cached rdd memory footprint.

This means that, without further work, warning messages will be logged for every python mllib regression and kmeans run. I added a patch that suppresses these warning messages during python runs in a way that I think is fairly unobtrusive. Please let me know what you think.

SparkQA · 2014-09-16T14:09:20Z

QA tests have started for PR 2347 at commit 9bed1fd.

This patch merges cleanly.

SparkQA · 2014-09-16T15:16:08Z

QA tests have finished for PR 2347 at commit 9bed1fd.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

…hed data.

staple · 2014-09-25T20:13:38Z

Hi, I addressed the recent review comments and merged.

SparkQA · 2014-09-25T20:14:33Z

QA tests have started for PR 2347 at commit bd49701.

This patch merges cleanly.

SparkQA · 2014-09-25T21:23:40Z

QA tests have finished for PR 2347 at commit bd49701.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-09-25T21:23:44Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20819/

mengxr · 2014-09-25T23:12:42Z

LGTM. Merged into master. What's your username on JIRA? I'll assign the JIRA to you. Thanks!

staple · 2014-09-26T00:05:03Z

Great, thanks. My username is 'staple', looks like you already assigned to me though.

mengxr reviewed Sep 10, 2014
View reviewed changes

staple mentioned this pull request Sep 11, 2014

[SPARK-3488][MLLIB] Cache python RDDs after deserialization for relevant iterative learners. #2362

Closed

staple force-pushed the SPARK-1484 branch from 03d0e2f to 9bed1fd Compare September 16, 2014 14:04

staple added 6 commits September 25, 2014 11:16

Minor doc example fixes.

3b6c511

[SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncac…

c77e939

…hed data.

Change warning message text.

7cca1dc

Change code comments per review comments.

a7a0f99

Disable warnings on python code path.

ab2d4a4

Address review comments.

bd49701

staple mentioned this pull request Sep 25, 2014

[SPARK-3550][MLLIB] Disable automatic rdd caching for relevant learners. #2412

Closed

staple force-pushed the SPARK-1484 branch from 9bed1fd to bd49701 Compare September 25, 2014 20:11

asfgit closed this in ff637c9 Sep 25, 2014

[SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data. #2347

[SPARK-1484][MLLIB] Warn when running an iterative algorithm on uncached data. #2347

Uh oh!

Conversation

staple commented Sep 10, 2014

Uh oh!

staple commented Sep 10, 2014

Uh oh!

SparkQA commented Sep 10, 2014

Uh oh!

mengxr Sep 10, 2014

Choose a reason for hiding this comment

Uh oh!

staple commented Sep 10, 2014

Uh oh!

mengxr Sep 10, 2014

Choose a reason for hiding this comment

Uh oh!

staple Sep 11, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr commented Sep 10, 2014

Uh oh!

staple commented Sep 11, 2014

Uh oh!

davies commented Sep 11, 2014

Uh oh!

mengxr commented Sep 12, 2014

Uh oh!

mengxr commented Sep 16, 2014

Uh oh!

mengxr commented Sep 16, 2014

Uh oh!

SparkQA commented Sep 16, 2014

Uh oh!

SparkQA commented Sep 16, 2014

Uh oh!

staple commented Sep 16, 2014

Uh oh!

SparkQA commented Sep 16, 2014

Uh oh!

SparkQA commented Sep 16, 2014

Uh oh!

staple commented Sep 25, 2014

Uh oh!

SparkQA commented Sep 25, 2014

Uh oh!

SparkQA commented Sep 25, 2014

Uh oh!

AmplabJenkins commented Sep 25, 2014

Uh oh!

mengxr commented Sep 25, 2014

Uh oh!

staple commented Sep 26, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants