-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30390][MLLIB] Avoid double caching in mllib.KMeans#runWithWeights. #27052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -219,11 +219,6 @@ class KMeans private ( | |
| data: RDD[(Vector, Double)], | ||
| instr: Option[Instrumentation]): KMeansModel = { | ||
|
|
||
| if (data.getStorageLevel == StorageLevel.NONE) { | ||
| logWarning("The input data is not directly cached, which may hurt performance if its" | ||
| + " parent RDDs are also uncached.") | ||
| } | ||
|
|
||
| // Compute squared norms and cache them. | ||
| val norms = data.map { case (v, _) => | ||
| Vectors.norm(v, 2.0) | ||
|
|
@@ -232,15 +227,13 @@ class KMeans private ( | |
| val zippedData = data.zip(norms).map { case ((v, w), norm) => | ||
| (new VectorWithNorm(v, norm), w) | ||
| } | ||
| zippedData.persist(StorageLevel.MEMORY_AND_DISK) | ||
| val model = runAlgorithmWithWeight(zippedData, instr) | ||
| zippedData.unpersist() | ||
|
|
||
| // Warn at the end of the run as well, for increased visibility. | ||
| if (data.getStorageLevel == StorageLevel.NONE) { | ||
| logWarning("The input data was not directly cached, which may hurt performance if its" | ||
| + " parent RDDs are also uncached.") | ||
| zippedData.persist(StorageLevel.MEMORY_AND_DISK) | ||
| } | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess we can remove the two warnings in this method? it's not a big deal now if the source data is uncached.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what about caching norms if data is already cached? like this: val handlePersistence = data.getStorageLevel == StorageLevel.NONE
val norms = ...
val zippedData = if (handlePersistence) {
data.zip(norms).map { case ((v, w), norm) =>
(new VectorWithNorm(v, norm), w)
}.persist(StorageLevel.MEMORY_AND_DISK)
} else {
norms.persist(StorageLevel.MEMORY_AND_DISK)
data.zip(norms).map { case ((v, w), norm) =>
(new VectorWithNorm(v, norm), w)
}
}
...
if (handlePersistence) {
zippedData.unpersist()
} else {
norms.unpersist()
}
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Won't this lead to double caching problem which we are trying to avoid?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I thought that was your point. If zippedData were expensive, I'd agree that caching the intermediate values too is worthwhile, and we do that in some places. Here it's not, and the original behavior was to always cache internally, so I guess this is less of a change. This at least skips it where it can be inexpensively computed |
||
| val model = runAlgorithmWithWeight(zippedData, instr) | ||
| zippedData.unpersist() | ||
|
|
||
| model | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I was testing spark kmeans. There should be an issue that no matter we persist the parent RDD, here the data.getStorageLevel will always be NONE due to the following operation, this will cause double caching.