[SPARK-23451][ML] Deprecate KMeans.computeCost #20629

mgaido91 · 2018-02-16T16:54:40Z

What changes were proposed in this pull request?

Deprecate KMeans.computeCost which was introduced as a temp fix and now it is not needed anymore, since we introduced ClusteringEvaluator.

How was this patch tested?

manual test (deprecation warning displayed)
Scala

...
scala> model.computeCost(dataset)
warning: there was one deprecation warning; re-run with -deprecation for details
res1: Double = 0.0

Python

>>> import warnings
>>> warnings.simplefilter('always', DeprecationWarning)
...
>>> model.computeCost(df)
/Users/mgaido/apache/spark/python/pyspark/ml/clustering.py:330: DeprecationWarning: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead.
  " instead.", DeprecationWarning)

SparkQA · 2018-02-16T18:03:07Z

Test build #87510 has finished for PR 20629 at commit 2f79bb2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2018-02-18T18:02:38Z

Just want to check - does computeCost do the same thing as the silhouette metric?

mgaido91 · 2018-02-18T18:35:14Z

thanks for taking a look at this @MLnick. No, it doesn't, in the sense that it returns a different result: this is the sum of the squared euclidean distance between a point and the centroid of the cluster it is assigned to, while the silhouette metric is the average of the silhouette coefficient. So they are completely different formulas.

The semantic is a bit different too. Silhouette measures both cohesion and separation of the clusters, while computeCost as it is measures only cohesion.

Nonetheless, of course both them can be used to evaluate the result of a clustering algorithm, even though the silhouette is much better for this purpose.

MLnick · 2018-02-18T18:41:18Z

Right - so while it’s perhaps a lower quality metric it is different. So I wonder if deprecation is the right approach (vs say putting the within cluster sum squares into ClusteringEvaluator).

…

On Sun, 18 Feb 2018 at 20:35, Marco Gaido ***@***.***> wrote: thanks for taking a look at this @MLnick <https://github.com/mlnick>. No, it doesn't, in the sense that it returns a different result: this is the sum of the squared euclidean distance between a point and the centroid of the cluster it is assigned to, while the silhouette metric is the average of the silhouette coefficient. So they are completely different formulas. The semantic is a bit different too. Silhouette measures both cohesion and separation of the clusters, while computeCost as it is measures only cohesion. Nonetheless, of course both them can be used to evaluate the result of a clustering algorithm, even though the silhouette is much better for this purpose. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20629 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_SB3VPksj5f9QN4Zo4v16_15YCsQdsks5tWG2MgaJpZM4SIn8J> .

MLnick · 2018-02-18T18:41:42Z

Sorry I mean putting the metric in evaluator and then also deprecating computCost On Sun, 18 Feb 2018 at 20:41, Nick Pentreath <nick.pentreath@gmail.com> wrote:

…

Right - so while it’s perhaps a lower quality metric it is different. So I wonder if deprecation is the right approach (vs say putting the within cluster sum squares into ClusteringEvaluator). On Sun, 18 Feb 2018 at 20:35, Marco Gaido ***@***.***> wrote: > thanks for taking a look at this @MLnick <https://github.com/mlnick>. > No, it doesn't, in the sense that it returns a different result: this is > the sum of the squared euclidean distance between a point and the centroid > of the cluster it is assigned to, while the silhouette metric is the > average of the silhouette coefficient. So they are completely different > formulas. > > The semantic is a bit different too. Silhouette measures both cohesion > and separation of the clusters, while computeCost as it is measures only > cohesion. > > Nonetheless, of course both them can be used to evaluate the result of a > clustering algorithm, even though the silhouette is much better for this > purpose. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#20629 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AA_SB3VPksj5f9QN4Zo4v16_15YCsQdsks5tWG2MgaJpZM4SIn8J> > . >

mgaido91 · 2018-02-18T19:02:33Z

yes, I agree with you @MLnick. I'd like to ping also @jkbradley and @hhbyyh who drove the PR which introduce computeCost.

In order to move the same metric to ClusteringEvaluator, I think it is very useful what is done in this ongoing PR #20600, where DistanceMeasure is moved to its own file and the cost method is added to it. So maybe it can be added there after that PR. What do you think?

PS if you agree, may you please then help reviewing that PR?

Thanks.

mgaido91 · 2018-03-01T16:32:59Z

sorry @MLnick, what do you think about my previous comment? Any thoughts? Thanks.

mgaido91 · 2018-03-16T11:33:47Z

@MLnick I checked and adding the computeCosts to ClusteringEvaluator has a small drawback: we have to compute the centers for each cluster and then we can compute the costs, which involved two passes on the dataset.

Since this and that this evaluation looks not very useful in practice, is it worth according to you to add it nonetheless?

Thanks.

SparkQA · 2018-03-16T12:36:34Z

Test build #88307 has finished for PR 20629 at commit 05680ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-03-28T11:09:57Z

kindly ping @MLnick @jkbradley @hhbyyh

mgaido91 · 2018-04-06T07:28:37Z

kindly ping @MLnick @jkbradley @hhbyyh

holdenk · 2018-04-06T18:32:17Z

So when you say "second pass over the data" - from looking at this it seems like it would could do this with just a second map to look up the predictions in the already computed cluster centers, not a stage boundary, so that probably wouldn't be all that expensive given how Spark does pipe-lining unless I'm mussing something.

This would mean that we'd have to have people set the cluster centers from their model when they wanted to do that evaluation type but given that the evaluate wouldn't be able to recover the cluster centers from a test that differed from the training set I think that would be reasonable.

That being said its been awhile since I've looked at the evaluator code so I could be coming out of left field.

mgaido91 · 2018-04-06T19:01:10Z

@holdenk I am not sure I got 100% what you meant, so I'll try to answer but let me know if I missed something.

The problem of doing 2 passes is related to cluster centers. The API of ClusteringEvaluator (as of any Evaluator) is very simple: it is has a method which gets a Dataset and returns a value. So, unlike the method here - which is part of the KMeansModel and it can get the cluster centers from it -, there is no clue about the cluster centers.

I understand that you are suggesting to add a setClusterCenters method on the ClusteringEvaluator, but I am not sure whether this is worth since they are needed only for this metric, while for the others so far (the Silhouette measure) they are useless. Moreover, this metric was introduced explicitly as a temp fix because we were missing any other (better) evaluation metric and it was supposed to be dismissed once a better evaluation metric would have been introduced (please see the related JIRA and PR). So I am not sure that introducing a new method specifically for this metric is a good idea. Anyway, I added it. So you can check it out and we can decide whether it is worth to maintain it or not.

What do you think?

SparkQA · 2018-04-11T16:56:44Z

Test build #89207 has finished for PR 20629 at commit 9c8cc67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-11T20:29:33Z

Test build #89205 has finished for PR 20629 at commit ca8c2ec.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

holdenk

Left some draft comments, lets see what @MLnick thinks of this. If he doesn't have time that's ok but I'd like his input.

I think after seeing this with the new param it maybe makes sense to not require the param (sorry) and do the double computation when the param isn't set so that the distance measure is more like the others but we also don't introduce a slow down for folks moving from the deprecated code path. What do you think?

holdenk · 2018-04-13T18:16:47Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

  /**
   * param for metric name in evaluation
-   * (supports `"silhouette"` (default))
+   * (supports `"silhouette"` (default), `"kmeansCost"`)


If we want to consider kmeansCost a legacy function lets call it out as such so new people don't start adding a hard dependency to it.

but does it make sense to introduce something which is already considered legacy when introduced?

I think this brigs up again the question: shall we maintain a metric which was introduced only temporary as a fallback due to the lack of better metrics?

Generally speaking I think it would make sense to maintain the fall-back metric until at least Spark 3.0 at which point I think it would make sense to ask on the user and dev lists and see if anyone is hard dependencies on it or if it is safe to remove.

ok, I agree. Let's go on this way then, thanks.

holdenk · 2018-04-13T18:19:22Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

  /**
   * param for distance measure to be used in evaluation
-   * (supports `"squaredEuclidean"` (default), `"cosine"`)
+   * (supports `"squaredEuclidean"` (default), `"cosine"`, `"euclidean"`)


If some models only support some ditance measures we should make that clear in the docs.

holdenk · 2018-04-13T18:20:25Z

python/pyspark/ml/clustering.py

+
+        ..note:: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead.
        """
+        warnings.warn("Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator"


If we do go this path we need to file a follow up JIRA to update Python ClusteringEvaluator.

yes, or I can also update it here once we establish for sure what the new API has to look like, as you prefer.

mgaido91 · 2018-04-14T09:36:48Z

@holdenk I am not sure about requiring or not cluster centers for this metric. On one side, since the ClusteringEvaluator should be a general interface for all clustering models and some of them don't provide cluster centers, it may be a good idea to compute them if necessary. On the other, does this metric make sense for any model other than KMeans? And computing the centers of the test dataset would lead to different results than the old API we are replacing. So I am not sure it is the right thing to do.

Honestly, the more we go on the more my feeling is that we don't really need to move that metric here. We can just deprecate it saying that there are better metrics for evaluating a clustering available in the ClusteringEvaluator (namely the silhouette). In these way people can move away from using this metric.

Moreover, sklearn - which is one of the most widespread tool - doesn't offer the ability of computing such a cost (http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). The only thing sklearn offers is what it calls inertia (https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/cluster/k_means_.py#L265), ie. the cost computed on the training set.

So, I think the best option would be to follow what sklearn does:

1 - Introducing in the KMeansSummary (or KMeansModel if you prefer) the cost attribute on the training set
2 - deprecate this method redirecting to ClusteringEvaluator for better metrics and/or to the cost attribute introduced

What do you think?

mgaido91 · 2018-04-23T12:44:13Z

kindly ping @holdenk @MLnick

mgaido91 · 2018-06-12T14:07:04Z

kindly ping @holdenk @MLnick

holdenk · 2018-06-13T16:34:20Z

cc @sethah

holdenk · 2018-06-22T16:20:03Z

ping @sethah :)

holdenk · 2018-06-28T19:47:06Z

ping @sethah?

sethah · 2018-06-28T20:44:57Z

+1 for @mgaido91's plan

mgaido91 · 2018-06-29T08:24:07Z

Thanks @holdenk @sethah, then I'll update this PR accordingly. Thanks.

This reverts commit ca8c2ec.

SparkQA · 2018-06-29T10:37:11Z

Test build #92475 has finished for PR 20629 at commit be954c6.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-29T14:38:11Z

Test build #92476 has finished for PR 20629 at commit 5a5c31f.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-06-29T15:31:58Z

Test build #92477 has finished for PR 20629 at commit 701b98a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-07-12T09:13:06Z

@holdenk @sethah any more comments on this? As 2.4 release is approaching I think it would be great to have this in, so we can remove the deprecated APIs in 3.0... Thanks.

holdenk

Love this different approach, I agree we should get this in before 2.4 closes so we can remove in 3 line. One comment though for API compatibility with the constructor. Really hope we can merge this soon :)

holdenk · 2018-07-13T17:39:07Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala

 @Since("0.8.0")
 class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vector],
-  @Since("2.4.0") val distanceMeasure: String)
+  @Since("2.4.0") val distanceMeasure: String, @Since("2.4.0") val trainingCost: Double)


Since we changed the constructor here, and since it is not private, we should provide a similar (and deprecated) constructor without training cost which calls this with the default value.

This constructor was introduced by a previous PR for 2.4, so this was never out (therefore I think we don't need to keep it).

That's awesome, lets try and get that fixed up before 2.4 goes so we don't have to any workarounds :)

holdenk · 2018-07-13T18:31:01Z

After I merged your other PR this has some minor conflicts so needs an update, but I'd be happy to try and get this in end of next week during my next review session :)

SparkQA · 2018-07-15T09:03:18Z

Test build #93016 has finished for PR 20629 at commit 926c353.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

LGTM thanks for getting this ready for removal in 3 :)

holdenk · 2018-07-20T16:19:59Z

Merged to master :) Thank you :)

mgaido91 · 2018-07-20T16:26:08Z

Thank you for reviewing @holdenk ☺

[SPARK-23451][ML] Deprecate KMeans.computeCost

2f79bb2

Merge branch 'master' of github.com:apache/spark into SPARK-23451

05680ea

mgaido91 added 2 commits April 11, 2018 17:07

add kmeansCost

ca8c2ec

Merge branch 'master' of github.com:apache/spark into SPARK-23451

9c8cc67

holdenk reviewed Apr 13, 2018

View reviewed changes

mgaido91 added 2 commits June 29, 2018 10:47

Revert "add kmeansCost"

e39ecd1

This reverts commit ca8c2ec.

add training cost to KMeansSummary

be954c6

mgaido91 added 3 commits June 29, 2018 12:53

fix mima

64b0fca

cleanup

5a5c31f

Merge branch 'master' of github.com:apache/spark into SPARK-23451

701b98a

holdenk requested changes Jul 13, 2018

View reviewed changes

Merge branch 'master' of github.com:apache/spark into SPARK-23451

926c353

holdenk approved these changes Jul 20, 2018

View reviewed changes

asfgit closed this in cc4d64b Jul 20, 2018

mgaido91 mentioned this pull request Oct 19, 2018

[SPARK-25765][ML] Add training cost to BisectingKMeans summary #22764

Closed

[SPARK-23451][ML] Deprecate KMeans.computeCost #20629

[SPARK-23451][ML] Deprecate KMeans.computeCost #20629

Uh oh!

Conversation

mgaido91 commented Feb 16, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 16, 2018

Uh oh!

MLnick commented Feb 18, 2018

Uh oh!

mgaido91 commented Feb 18, 2018

Uh oh!

MLnick commented Feb 18, 2018 via email

Uh oh!

MLnick commented Feb 18, 2018 via email

Uh oh!

mgaido91 commented Feb 18, 2018

Uh oh!

mgaido91 commented Mar 1, 2018

Uh oh!

mgaido91 commented Mar 16, 2018

Uh oh!

SparkQA commented Mar 16, 2018

Uh oh!

mgaido91 commented Mar 28, 2018

Uh oh!

mgaido91 commented Apr 6, 2018

Uh oh!

holdenk commented Apr 6, 2018

Uh oh!

mgaido91 commented Apr 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Apr 11, 2018

Uh oh!

SparkQA commented Apr 11, 2018

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Apr 14, 2018

Uh oh!

mgaido91 commented Apr 23, 2018

Uh oh!

mgaido91 commented Jun 12, 2018

Uh oh!

holdenk commented Jun 13, 2018

Uh oh!

holdenk commented Jun 22, 2018

Uh oh!

holdenk commented Jun 28, 2018

Uh oh!

sethah commented Jun 28, 2018

Uh oh!

mgaido91 commented Jun 29, 2018

Uh oh!

SparkQA commented Jun 29, 2018

Uh oh!

SparkQA commented Jun 29, 2018

Uh oh!

SparkQA commented Jun 29, 2018

Uh oh!

mgaido91 commented Jul 12, 2018

Uh oh!

mgaido91 commented Apr 6, 2018 •

edited

Loading