[SPARK-3550][MLLIB] Disable automatic rdd caching for relevant learners. #2412

staple · 2014-09-16T15:36:01Z

The NaiveBayes, ALS, and DecisionTree learners do not require external caching to prevent repeated RDD re-evaluation during learning iterations. NaiveBayes only evaluates its input RDD once, while ALS and DecisionTree internally persist transformations of their input RDDs.

SparkQA · 2014-09-16T15:37:09Z

Can one of the admins verify this patch?

mengxr · 2014-09-16T17:05:00Z

add to whitelist

mengxr · 2014-09-16T17:05:06Z

this is ok to test

SparkQA · 2014-09-16T17:09:31Z

QA tests have started for PR 2412 at commit c8ff120.

This patch merges cleanly.

SparkQA · 2014-09-16T18:18:44Z

QA tests have finished for PR 2412 at commit c8ff120.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NonASCIICharacterChecker extends ScalariformChecker

davies · 2014-09-17T17:29:29Z

@staple I also addressed this in #2378 , could you help to review this part?

davies · 2014-09-24T00:30:48Z

@staple could you rebase this PR?

staple · 2014-09-25T18:50:20Z

@davies It looks like in your #2378 you already disabled caching for NaiveBayes and DecisionTree. The only difference from this patch is that I disabled caching for ALS as well.

We discussed this a bit here: #2378 (comment). I filed SPARK-3550 as a follow up of the work on uncached input warnings (#2347). The warnings are only supposed to be printed if the input data is accessed repeatedly on many iterations during learning. That's not the case with ALS, so a warning shouldn't be printed there. But I can see there's a case for caching because the input data is accessed not once but twice when constructing an intermediate representation of the data. I don't have a strong preference on whether we should or should not cache in python for the ALS learner.

If you are fine with continuing to cache in python for ALS, then there's no more work to be done for this ticket, SPARK-3550.

davies · 2014-09-25T21:29:09Z

@staple thanks, I'd like to keep it as before for ALS, could you close this PR (maybe also the issue)?

staple · 2014-09-25T21:34:24Z

@davies, sure will do

[SPARK-3550][MLLIB] Disable automatic rdd caching for relevant learners.

c8ff120

staple mentioned this pull request Sep 16, 2014

[SPARK-3488][MLLIB] Cache python RDDs after deserialization for relevant iterative learners. #2362

Closed

staple closed this Sep 25, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-3550][MLLIB] Disable automatic rdd caching for relevant learners. #2412

[SPARK-3550][MLLIB] Disable automatic rdd caching for relevant learners. #2412

Uh oh!

staple commented Sep 16, 2014

Uh oh!

SparkQA commented Sep 16, 2014

Uh oh!

mengxr commented Sep 16, 2014

Uh oh!

mengxr commented Sep 16, 2014

Uh oh!

SparkQA commented Sep 16, 2014

Uh oh!

SparkQA commented Sep 16, 2014

Uh oh!

davies commented Sep 17, 2014

Uh oh!

davies commented Sep 24, 2014

Uh oh!

staple commented Sep 25, 2014

Uh oh!

davies commented Sep 25, 2014

Uh oh!

staple commented Sep 25, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-3550][MLLIB] Disable automatic rdd caching for relevant learners. #2412

[SPARK-3550][MLLIB] Disable automatic rdd caching for relevant learners. #2412

Uh oh!

Conversation

staple commented Sep 16, 2014

Uh oh!

SparkQA commented Sep 16, 2014

Uh oh!

mengxr commented Sep 16, 2014

Uh oh!

mengxr commented Sep 16, 2014

Uh oh!

SparkQA commented Sep 16, 2014

Uh oh!

SparkQA commented Sep 16, 2014

Uh oh!

davies commented Sep 17, 2014

Uh oh!

davies commented Sep 24, 2014

Uh oh!

staple commented Sep 25, 2014

Uh oh!

davies commented Sep 25, 2014

Uh oh!

staple commented Sep 25, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants