-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23451][ML] Deprecate KMeans.computeCost #20629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #87510 has finished for PR 20629 at commit
|
|
Just want to check - does |
|
thanks for taking a look at this @MLnick. No, it doesn't, in the sense that it returns a different result: this is the sum of the squared euclidean distance between a point and the centroid of the cluster it is assigned to, while the silhouette metric is the average of the silhouette coefficient. So they are completely different formulas. The semantic is a bit different too. Silhouette measures both cohesion and separation of the clusters, while Nonetheless, of course both them can be used to evaluate the result of a clustering algorithm, even though the silhouette is much better for this purpose. |
|
Right - so while it’s perhaps a lower quality metric it is different. So I
wonder if deprecation is the right approach (vs say putting the within
cluster sum squares into ClusteringEvaluator).
…On Sun, 18 Feb 2018 at 20:35, Marco Gaido ***@***.***> wrote:
thanks for taking a look at this @MLnick <https://github.com/mlnick>. No,
it doesn't, in the sense that it returns a different result: this is the
sum of the squared euclidean distance between a point and the centroid of
the cluster it is assigned to, while the silhouette metric is the average
of the silhouette coefficient. So they are completely different formulas.
The semantic is a bit different too. Silhouette measures both cohesion and
separation of the clusters, while computeCost as it is measures only
cohesion.
Nonetheless, of course both them can be used to evaluate the result of a
clustering algorithm, even though the silhouette is much better for this
purpose.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20629 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SB3VPksj5f9QN4Zo4v16_15YCsQdsks5tWG2MgaJpZM4SIn8J>
.
|
|
Sorry I mean putting the metric in evaluator and then also deprecating
computCost
On Sun, 18 Feb 2018 at 20:41, Nick Pentreath <nick.pentreath@gmail.com>
wrote:
… Right - so while it’s perhaps a lower quality metric it is different. So I
wonder if deprecation is the right approach (vs say putting the within
cluster sum squares into ClusteringEvaluator).
On Sun, 18 Feb 2018 at 20:35, Marco Gaido ***@***.***>
wrote:
> thanks for taking a look at this @MLnick <https://github.com/mlnick>.
> No, it doesn't, in the sense that it returns a different result: this is
> the sum of the squared euclidean distance between a point and the centroid
> of the cluster it is assigned to, while the silhouette metric is the
> average of the silhouette coefficient. So they are completely different
> formulas.
>
> The semantic is a bit different too. Silhouette measures both cohesion
> and separation of the clusters, while computeCost as it is measures only
> cohesion.
>
> Nonetheless, of course both them can be used to evaluate the result of a
> clustering algorithm, even though the silhouette is much better for this
> purpose.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#20629 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AA_SB3VPksj5f9QN4Zo4v16_15YCsQdsks5tWG2MgaJpZM4SIn8J>
> .
>
|
|
yes, I agree with you @MLnick. I'd like to ping also @jkbradley and @hhbyyh who drove the PR which introduce In order to move the same metric to PS if you agree, may you please then help reviewing that PR? Thanks. |
|
sorry @MLnick, what do you think about my previous comment? Any thoughts? Thanks. |
|
@MLnick I checked and adding the Since this and that this evaluation looks not very useful in practice, is it worth according to you to add it nonetheless? Thanks. |
|
Test build #88307 has finished for PR 20629 at commit
|
|
kindly ping @MLnick @jkbradley @hhbyyh |
|
kindly ping @MLnick @jkbradley @hhbyyh |
|
So when you say "second pass over the data" - from looking at this it seems like it would could do this with just a second map to look up the predictions in the already computed cluster centers, not a stage boundary, so that probably wouldn't be all that expensive given how Spark does pipe-lining unless I'm mussing something. This would mean that we'd have to have people set the cluster centers from their model when they wanted to do that evaluation type but given that the evaluate wouldn't be able to recover the cluster centers from a test that differed from the training set I think that would be reasonable. That being said its been awhile since I've looked at the evaluator code so I could be coming out of left field. |
|
@holdenk I am not sure I got 100% what you meant, so I'll try to answer but let me know if I missed something. The problem of doing 2 passes is related to cluster centers. The API of I understand that you are suggesting to add a What do you think? |
|
Test build #89207 has finished for PR 20629 at commit
|
|
Test build #89205 has finished for PR 20629 at commit
|
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some draft comments, lets see what @MLnick thinks of this. If he doesn't have time that's ok but I'd like his input.
I think after seeing this with the new param it maybe makes sense to not require the param (sorry) and do the double computation when the param isn't set so that the distance measure is more like the others but we also don't introduce a slow down for folks moving from the deprecated code path. What do you think?
| /** | ||
| * param for metric name in evaluation | ||
| * (supports `"silhouette"` (default)) | ||
| * (supports `"silhouette"` (default), `"kmeansCost"`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to consider kmeansCost a legacy function lets call it out as such so new people don't start adding a hard dependency to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but does it make sense to introduce something which is already considered legacy when introduced?
I think this brigs up again the question: shall we maintain a metric which was introduced only temporary as a fallback due to the lack of better metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking I think it would make sense to maintain the fall-back metric until at least Spark 3.0 at which point I think it would make sense to ask on the user and dev lists and see if anyone is hard dependencies on it or if it is safe to remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I agree. Let's go on this way then, thanks.
| /** | ||
| * param for distance measure to be used in evaluation | ||
| * (supports `"squaredEuclidean"` (default), `"cosine"`) | ||
| * (supports `"squaredEuclidean"` (default), `"cosine"`, `"euclidean"`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If some models only support some ditance measures we should make that clear in the docs.
python/pyspark/ml/clustering.py
Outdated
| ..note:: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead. | ||
| """ | ||
| warnings.warn("Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do go this path we need to file a follow up JIRA to update Python ClusteringEvaluator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, or I can also update it here once we establish for sure what the new API has to look like, as you prefer.
|
@holdenk I am not sure about requiring or not cluster centers for this metric. On one side, since the Honestly, the more we go on the more my feeling is that we don't really need to move that metric here. We can just deprecate it saying that there are better metrics for evaluating a clustering available in the Moreover, sklearn - which is one of the most widespread tool - doesn't offer the ability of computing such a cost (http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). The only thing sklearn offers is what it calls So, I think the best option would be to follow what sklearn does: 1 - Introducing in the What do you think? |
1 similar comment
|
cc @sethah |
|
ping @sethah :) |
|
ping @sethah? |
|
+1 for @mgaido91's plan |
This reverts commit ca8c2ec.
|
Test build #92475 has finished for PR 20629 at commit
|
|
Test build #92476 has finished for PR 20629 at commit
|
|
Test build #92477 has finished for PR 20629 at commit
|
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this different approach, I agree we should get this in before 2.4 closes so we can remove in 3 line. One comment though for API compatibility with the constructor. Really hope we can merge this soon :)
| @Since("0.8.0") | ||
| class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vector], | ||
| @Since("2.4.0") val distanceMeasure: String) | ||
| @Since("2.4.0") val distanceMeasure: String, @Since("2.4.0") val trainingCost: Double) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we changed the constructor here, and since it is not private, we should provide a similar (and deprecated) constructor without training cost which calls this with the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constructor was introduced by a previous PR for 2.4, so this was never out (therefore I think we don't need to keep it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's awesome, lets try and get that fixed up before 2.4 goes so we don't have to any workarounds :)
|
After I merged your other PR this has some minor conflicts so needs an update, but I'd be happy to try and get this in end of next week during my next review session :) |
|
Test build #93016 has finished for PR 20629 at commit
|
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for getting this ready for removal in 3 :)
|
Merged to master :) Thank you :) |
|
Thank you for reviewing @holdenk ☺ |
What changes were proposed in this pull request?
Deprecate
KMeans.computeCostwhich was introduced as a temp fix and now it is not needed anymore, since we introducedClusteringEvaluator.How was this patch tested?
manual test (deprecation warning displayed)
Scala
Python