[SPARK-30390][MLLIB] Avoid double caching in mllib.KMeans#runWithWeights. #27052

amanomer · 2019-12-30T15:49:09Z

What changes were proposed in this pull request?

Check before caching zippedData (as suggested in #26483 (comment)).

Why are the changes needed?

If the data is already cached before calling run method of KMeans then zippedData.persist() will hurt the performance. Hence, persisting it conditionally.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually.

amanomer · 2019-12-30T15:49:39Z

cc @srowen @zhengruifeng

srowen · 2019-12-30T15:52:13Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+
+    if (data.getStorageLevel == StorageLevel.NONE) {
+      zippedData.persist(StorageLevel.MEMORY_AND_DISK)
+    }


I guess we can remove the two warnings in this method? it's not a big deal now if the source data is uncached.

what about caching norms if data is already cached? like this:

val handlePersistence = data.getStorageLevel == StorageLevel.NONE val norms = ... val zippedData = if (handlePersistence) { data.zip(norms).map { case ((v, w), norm) => (new VectorWithNorm(v, norm), w) }.persist(StorageLevel.MEMORY_AND_DISK) } else { norms.persist(StorageLevel.MEMORY_AND_DISK) data.zip(norms).map { case ((v, w), norm) => (new VectorWithNorm(v, norm), w) } } ... if (handlePersistence) { zippedData.unpersist() } else { norms.unpersist() }

what about caching norms if data is already cached?

Won't this lead to double caching problem which we are trying to avoid?

Yeah I thought that was your point. If zippedData were expensive, I'd agree that caching the intermediate values too is worthwhile, and we do that in some places. Here it's not, and the original behavior was to always cache internally, so I guess this is less of a change. This at least skips it where it can be inexpensively computed

srowen · 2019-12-30T16:50:35Z

Jenkins, test this please

SparkQA · 2019-12-30T18:03:40Z

Test build #115963 has finished for PR 27052 at commit 7af2785.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

amanomer · 2020-01-03T02:16:57Z

cc @srowen @zhengruifeng

srowen · 2020-01-04T16:16:51Z

We can change this further, but this is an improvement and less of a change than anything else we'd do. I'll merge it.

amanomer · 2020-01-04T17:50:56Z

Thanks @srowen

yeyuqiang · 2020-08-12T15:33:27Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-    zippedData.unpersist()

-    // Warn at the end of the run as well, for increased visibility.
    if (data.getStorageLevel == StorageLevel.NONE) {


Hi, I was testing spark kmeans. There should be an issue that no matter we persist the parent RDD, here the data.getStorageLevel will always be NONE due to the following operation, this will cause double caching.

def run(data: RDD[Vector]): KMeansModel = { val instances: RDD[(Vector, Double)] = data.map { case (point) => (point, 1.0) } runWithWeight(instances, None) }

Initial commit

2e6f156

srowen reviewed Dec 30, 2019

View reviewed changes

Remove warnings

e7b1587

amanomer requested a review from srowen December 30, 2019 15:56

ScalaStyle Fix

7af2785

huaxingao mentioned this pull request Dec 30, 2019

[SPARK-30351][ML][PySpark] BisectingKMeans support instance weighting #27035

Closed

amanomer requested a review from zhengruifeng January 3, 2020 02:16

srowen closed this in 4a234dd Jan 4, 2020

huaxingao added a commit to huaxingao/spark that referenced this pull request Jan 7, 2020

follow caching strategy in PR apache#27052

6ca078e

yeyuqiang reviewed Aug 12, 2020

View reviewed changes

[SPARK-30390][MLLIB] Avoid double caching in mllib.KMeans#runWithWeights. #27052

[SPARK-30390][MLLIB] Avoid double caching in mllib.KMeans#runWithWeights. #27052

Uh oh!

Conversation

amanomer commented Dec 30, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

amanomer commented Dec 30, 2019

Uh oh!

srowen Dec 30, 2019

Choose a reason for hiding this comment

Uh oh!

amanomer Dec 30, 2019

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Dec 31, 2019

Choose a reason for hiding this comment

Uh oh!

amanomer Dec 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen Dec 31, 2019

Choose a reason for hiding this comment

Uh oh!

srowen commented Dec 30, 2019

Uh oh!

SparkQA commented Dec 30, 2019

Uh oh!

amanomer commented Jan 3, 2020

Uh oh!

srowen commented Jan 4, 2020

Uh oh!

amanomer commented Jan 4, 2020

Uh oh!

yeyuqiang Aug 12, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amanomer Dec 31, 2019 •

edited

Loading