-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25765][ML] Add training cost to BisectingKMeans summary #22764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #97529 has finished for PR 22764 at commit
|
|
does the example need to be updated with this new API? |
|
For KMeans we used the ClusteringEvaluator in the examples. Actually, the training cost is not a good way to evaluate a dataset (the evaluation should be done on a dataset different from the training one). Maybe we can also add this API in the example to show that it exists, but I have seen no example with a summary shown so... |
| k: Int, | ||
| numIter: Int) extends ClusteringSummary(predictions, predictionCol, featuresCol, k, numIter) | ||
| numIter: Int, | ||
| @Since("2.4.0") val trainingCost: Double) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.4.0? or 3.0.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR targets to 2.4, see more context at #22756
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait. If the final goal is to have a consistent ML API in 3.0, do we have to put this new API in 2.4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't consider this as mandatory. I think what is mandatory to target for 2.4 is to deprecate the computeCost method. But I think it is a nice to have, since it is a non-deprecated way users have to access this information. In the PR related to KMeans, there was quite a discussion about it and it was considered to be part of the deprecation change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I see. Then looks like it is nice to have this in 2.4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since #22756 is reverted, are we going to change the Since version for this, too?
| assert(formatVersion == thisFormatVersion) | ||
| val rootId = (metadata \ "rootId").extract[Int] | ||
| val distanceMeasure = (metadata \ "distanceMeasure").extract[String] | ||
| val trainingCost = (metadata \ "trainingCost").extract[Double] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, can this read old model from previous version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Could you avoid modifying loading model code in "mllib" package, but modifying code in "ml" package, i.e., the class
ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader, you can reference theKMeanscode:ml.clustering.KMeansModel.KMeansModelReader.
(Don't letml.clustering.BisectingKMeansModel.BisectingKMeansModelReadercallmllib.clustering.BisectingKMeansModel.load) -
And, +1 with @viirya mentioned, we should keep model loading compatibility, add a version check (when >= 2.4) then we load "training cost" . Note that add these in
ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader. -
And, could you also add version check (when >= 2.4) then we load "training cost" into
ml.clustering.KMeansModel.KMeansModelReader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do other models have this problem? I was told that this change just follows what we did for other models before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you all for the comments and sorry for the late answer. Just a couple of notes on your comments @WeichenXu123 (I may be missing something, so please correct me if I am wrong):
- I checked the
ml.clustering.KMeansModel.KMeansModelReaderand it doesn't store anything related to the summary. Summary is not recovered after load of the model, so I don't see any reason why we should modify the read/load ofml.clustering.BisectingKMeansModel.BisectingKMeansModelReader; - this model can read from previous versions, since this is version "2.0", which was introduced for Spark 2.4; for previous versions, we read/write version "1.0"; the version check method for versioning is used only for the
mlpackage, not inmllibwhere we have this versioning approach;
I was told that this change just follows what we did for other models before.
@cloud-fan Yes, let me link the PR for KMeans doing the same, which is: #20629.
Just a final comment which I hope clarifies which is the source of the confusion here and the reason of the above comments by @viirya and @WeichenXu123: trainingCost here is a member of the summary, not a parameter of the model for the ml.clustering.BisectingKMeansModel. Instead, it is a member of the model for mllib.clustering.BisectingKMeansModel (we have no summary notion there...). So storing it for mllib is needed in order for the model read after persisting it being the same of the original one (I think it doesn't pass UTs otherwise). Storing it for the ml, instead, it is not needed, because the summary is not persisted. If we want to persist also the summary for ml package I think we should best create a separate JIRA and PR for it.
Hope this clarifies (sorry for being so verbose).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this model can read from previous versions, since this is version "2.0", which was introduced for Spark 2.4; for previous versions, we read/write version "1.0"; the version check method for versioning is used only for the ml package, not in mllib where we have this versioning approach;
I meant that can it read old model from previous versions, not that this model can read from previous versions.
In other words, when reading a previous model without "trainingCost" in metadata, can this line work well?
val trainingCost = (metadata \ "trainingCost").extract[Double]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91
(I haven't test this, so correct me if I am wrong).
I don't see (and think) this change breaks backwards compatibility for mllib.
I am suspicious of this line in load method:
val trainingCost = (metadata \ "trainingCost").extract[Double]
When loading an old version spark(e.g. spark 2.3.1) saved BisectingKMeansModel, because it do not contain "trainingCost" info, I guess this line will throw error. (Otherwise what will it return ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeichenXu123 I have explained it in and #22764 (comment). If you don't agree or believe on what I said you can try it.
A model saved in 2.3.1 will have "1.0" as version. So this code is not run. Every model from 2.4.0 on, will be saved with "2.0" as version, so it will have this stored. As mentioned, please notice that SaveLoadV2_0 was introduced for 2.4.0. Of course, if this commit won't go in 2.4, then I'll have to create a SaveLoadV3_0 in order to support it (or, if we agree that this doesn't need to be restored after model persistence, we can just ignore it).
Hope this clarifies. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am more confusing...
-
I think this line should be a already existed mistake. It is too weird.
I think it should becase (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) => val model = SaveLoadV2_0.load(sc, path) -
Suppose you're right, then in which place your code call
SaveLoadV2_0? I don't find it ... ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes @WeichenXu123 , you're right, that line is a bug. Thanks for noticing it. Anyway, that is going to be addressed in another PR and it is not (strictly) related to this one. The other option, as I mentioned, is that if we agree that this doesn't need to be restored after model persistence, we can just ignore it in save/load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. After #22790 merged, I think this PR can work.
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several suggestion. Thanks!
| assert(formatVersion == thisFormatVersion) | ||
| val rootId = (metadata \ "rootId").extract[Int] | ||
| val distanceMeasure = (metadata \ "distanceMeasure").extract[String] | ||
| val trainingCost = (metadata \ "trainingCost").extract[Double] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Could you avoid modifying loading model code in "mllib" package, but modifying code in "ml" package, i.e., the class
ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader, you can reference theKMeanscode:ml.clustering.KMeansModel.KMeansModelReader.
(Don't letml.clustering.BisectingKMeansModel.BisectingKMeansModelReadercallmllib.clustering.BisectingKMeansModel.load) -
And, +1 with @viirya mentioned, we should keep model loading compatibility, add a version check (when >= 2.4) then we load "training cost" . Note that add these in
ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader. -
And, could you also add version check (when >= 2.4) then we load "training cost" into
ml.clustering.KMeansModel.KMeansModelReader?
|
Since this PR is a little more complicated than we expect, we decided to not have it in 2.4.0. I'm not sure if we can treat it as a special case and put it in 2.4.1, cc @mengxr Anyway, the other 2 related PRs(deprecating the API and updating the example) are reverted. We need to think about what we should do if we can only do this in 3.0. |
|
as @WeichenXu123 mentioned in #22764 (comment), I don't see other problems with the current PR. The only thing is: do we want to target it for 2.4 or for 3.0? If the latter, I'll update the PR with the proper deprecation messages. Thanks. |
|
I think it can target for 3.0. since 2.4 will be released soon and this PR looks a little complex and need take some time to check carefully. |
|
thanks @WeichenXu123 , I updated this PR in order to target 3.0. |
|
Test build #98169 has finished for PR 22764 at commit
|
|
any more comments on this? Thanks |
|
cc @dbtsai |
|
@dbtsai any comments on this? thanks. |
|
@mgaido91 I'm on thanksgiving vacation, will be back to community to help code review on Nov 21st. Sorry for the delay. |
|
@dbtsai sure, thanks. Sorry for bothering you. Have a nice vacation! |
|
@dbtsai any luck with this? Thanks. |
|
kindly ping @dbtsai |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only real concern is that old models have a training cost of 0, when it's unknown really. I don't think it's worth making the new value an Option[Double] because it's not really that optional. If we can compute it in more cases, that's great, would be fine. If not, probably still OK, just less ideal.
| * @param featuresCol Name for column of features in `predictions`. | ||
| * @param k Number of clusters. | ||
| * @param numIter Number of iterations. | ||
| * @param trainingCost Sum of squared distances to the nearest centroid for all points in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the cost ever be something besides sum of squares? maybe not, just wondering if we should say here what the cost function is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you're right, let me update it with a more generic "cost", thanks.
| k: Int, | ||
| numIter: Int) extends ClusteringSummary(predictions, predictionCol, featuresCol, k, numIter) | ||
| numIter: Int, | ||
| @Since("3.0.0") val trainingCost: Double) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a big deal for 3.0, but we lose the constructor without the new param. That's probably OK as the summary kind of needs this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this constructor is private so I don't think it is a problem to avoid having the previous one.
|
|
||
| @Since("1.6.0") | ||
| def this(root: ClusteringTreeNode) = this(root, DistanceMeasure.EUCLIDEAN) | ||
| def this(root: ClusteringTreeNode) = this(root, DistanceMeasure.EUCLIDEAN, 0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, we did preserve this old constructor, and that's fine to keep. The other issue I see here is that the cost is 0, when the cost is really unknown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, because this is public, so users may rely on it. The idea is that this is indeed a "new feature" (previously is was not accessible) and we are not guaranteeing new features in the MLLib API. I just followed the same approach which was used for KMeans.
| val nodes = data.rdd.map(Data.apply).collect().map(d => (d.index, d)).toMap | ||
| val rootNode = buildTree(rootId, nodes) | ||
| new BisectingKMeansModel(rootNode, DistanceMeasure.EUCLIDEAN) | ||
| new BisectingKMeansModel(rootNode, DistanceMeasure.EUCLIDEAN, 0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to compute the cost here after load rather than setting it to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite it would be possible, I don't think it is a good idea, as this is only for the old MLLib API and it would introduce a significant performance overhead (a pass over all the dataset) for an information which may not be useful at all...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it not just be the same? rootNode.leafNodes.map(_.cost).sum? If that cost info is present in the nodes (?) it doesn't need a pass over data (which indeed doesn't exist at this point). If it's valuable enough to include at all, should this info not be correct where it is in fact available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you're right, thanks. It is indeed available. I am doing it, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks OK except for this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, right, sorry, I missed it, I did it only for the version 2.0 and missed this one. I am updating it, thanks.
|
Test build #100420 has finished for PR 22764 at commit
|
|
Test build #100427 has finished for PR 22764 at commit
|
|
retest this please |
|
Test build #100433 has finished for PR 22764 at commit
|
|
Test build #100597 has finished for PR 22764 at commit
|
|
Merged to master |
## What changes were proposed in this pull request? The PR adds the `trainingCost` value to the `BisectingKMeansSummary`, in order to expose the information retrievable by running `computeCost` on the training dataset. This fills the gap with `KMeans` implementation. ## How was this patch tested? improved UTs Closes apache#22764 from mgaido91/SPARK-25765. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
The PR adds the
trainingCostvalue to theBisectingKMeansSummary, in order to expose the information retrievable by runningcomputeCoston the training dataset. This fills the gap withKMeansimplementation.How was this patch tested?
improved UTs