-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29818][MLLIB] Missing persist on RDD #26454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4acb797
fa920f7
daad006
cec29ff
4aa39fc
31c0fe7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,6 +22,7 @@ import org.apache.spark.internal.Logging | |
| import org.apache.spark.mllib.evaluation.binary._ | ||
| import org.apache.spark.rdd.{RDD, UnionRDD} | ||
| import org.apache.spark.sql.{DataFrame, Row} | ||
| import org.apache.spark.storage.StorageLevel | ||
|
|
||
| /** | ||
| * Evaluator for binary classification. | ||
|
|
@@ -165,13 +166,17 @@ class BinaryClassificationMetrics @Since("3.0.0") ( | |
| confusions: RDD[(Double, BinaryConfusionMatrix)]) = { | ||
| // Create a bin for each distinct score value, count weighted positives and | ||
| // negatives within each bin, and then sort by score values in descending order. | ||
| val counts = scoreLabelsWeight.combineByKey( | ||
| val binnedWeights = scoreLabelsWeight.combineByKey( | ||
| createCombiner = (labelAndWeight: (Double, Double)) => | ||
| new BinaryLabelCounter(0.0, 0.0) += (labelAndWeight._1, labelAndWeight._2), | ||
| mergeValue = (c: BinaryLabelCounter, labelAndWeight: (Double, Double)) => | ||
| c += (labelAndWeight._1, labelAndWeight._2), | ||
| mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2 | ||
| ).sortByKey(ascending = false) | ||
| ) | ||
| if (scoreLabelsWeight.getStorageLevel != StorageLevel.NONE) { | ||
| binnedWeights.persist() | ||
| } | ||
| val counts = binnedWeights.sortByKey(ascending = false) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wait, hm, I don't understand this. You persist binnedWeights, but it is now only used once. Why? If anything it's binnedCounts that needs persisting. I'm still not clear if it makes enough difference to matter.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but, why bother persisting binnedWeights? you recompute everything in between it and binnedCounts twice, when I think that would be the point, to avoid that.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might be wrong here. Kindly correct me @srowen
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. caching helps where more than one action is performed on the same RDD. That's not the case here. Each of the first two has one thing executed on it. sortByKey is not an action, anyway.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, okay. One question here, will it be worth persisting
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't seem so. But that is the question I'd put to you in these cases - are you sure it makes a difference meaningful enough to overcome the overhead? I could imagine so here, just wondering if these are based on more investigation or benchmarking, vs just trying to persist lots of things.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think, no. Persisting
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TYSM @srowen . Looking forward for more learning opportunities. |
||
|
|
||
| val binnedCounts = | ||
| // Only down-sample if bins is > 0 | ||
|
|
@@ -215,6 +220,7 @@ class BinaryClassificationMetrics @Since("3.0.0") ( | |
| val partitionwiseCumulativeCounts = | ||
| agg.scanLeft(new BinaryLabelCounter())((agg, c) => agg.clone() += c) | ||
| val totalCount = partitionwiseCumulativeCounts.last | ||
| binnedWeights.unpersist() | ||
| logInfo(s"Total counts: $totalCount") | ||
| val cumulativeCounts = binnedCounts.mapPartitionsWithIndex( | ||
| (index: Int, iter: Iterator[(Double, BinaryLabelCounter)]) => { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still isn't unpersisted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean it should be unpersisted after use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes otherwise the caller has no way to unpersist it until it's GCed