Skip to content

Conversation

@amanomer
Copy link
Contributor

What changes were proposed in this pull request?

Check before caching zippedData (as suggested in #26483 (comment)).

Why are the changes needed?

If the data is already cached before calling run method of KMeans then zippedData.persist() will hurt the performance. Hence, persisting it conditionally.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually.

@amanomer
Copy link
Contributor Author

cc @srowen @zhengruifeng


if (data.getStorageLevel == StorageLevel.NONE) {
zippedData.persist(StorageLevel.MEMORY_AND_DISK)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can remove the two warnings in this method? it's not a big deal now if the source data is uncached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about caching norms if data is already cached? like this:

val handlePersistence = data.getStorageLevel == StorageLevel.NONE
val norms = ...
val zippedData = if (handlePersistence) {
   data.zip(norms).map { case ((v, w), norm) =>
      (new VectorWithNorm(v, norm), w)
    }.persist(StorageLevel.MEMORY_AND_DISK)
} else {
     norms.persist(StorageLevel.MEMORY_AND_DISK)
     data.zip(norms).map { case ((v, w), norm) =>
        (new VectorWithNorm(v, norm), w)
     }
}

...


if (handlePersistence) {
   zippedData.unpersist()
} else {
   norms.unpersist()
}

Copy link
Contributor Author

@amanomer amanomer Dec 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about caching norms if data is already cached?

Won't this lead to double caching problem which we are trying to avoid?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought that was your point. If zippedData were expensive, I'd agree that caching the intermediate values too is worthwhile, and we do that in some places. Here it's not, and the original behavior was to always cache internally, so I guess this is less of a change. This at least skips it where it can be inexpensively computed

@amanomer amanomer requested a review from srowen December 30, 2019 15:56
@srowen
Copy link
Member

srowen commented Dec 30, 2019

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Dec 30, 2019

Test build #115963 has finished for PR 27052 at commit 7af2785.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@amanomer amanomer requested a review from zhengruifeng January 3, 2020 02:16
@amanomer
Copy link
Contributor Author

amanomer commented Jan 3, 2020

cc @srowen @zhengruifeng

@srowen
Copy link
Member

srowen commented Jan 4, 2020

We can change this further, but this is an improvement and less of a change than anything else we'd do. I'll merge it.

@srowen srowen closed this in 4a234dd Jan 4, 2020
@amanomer
Copy link
Contributor Author

amanomer commented Jan 4, 2020

Thanks @srowen

huaxingao added a commit to huaxingao/spark that referenced this pull request Jan 7, 2020
zippedData.unpersist()

// Warn at the end of the run as well, for increased visibility.
if (data.getStorageLevel == StorageLevel.NONE) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I was testing spark kmeans. There should be an issue that no matter we persist the parent RDD, here the data.getStorageLevel will always be NONE due to the following operation, this will cause double caching.

def run(data: RDD[Vector]): KMeansModel = {
    val instances: RDD[(Vector, Double)] = data.map {
      case (point) => (point, 1.0)
    }
    runWithWeight(instances, None)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants