Skip to content

Conversation

@mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

Deprecate KMeans.computeCost which was introduced as a temp fix and now it is not needed anymore, since we introduced ClusteringEvaluator.

How was this patch tested?

manual test (deprecation warning displayed)
Scala

...
scala> model.computeCost(dataset)
warning: there was one deprecation warning; re-run with -deprecation for details
res1: Double = 0.0

Python

>>> import warnings
>>> warnings.simplefilter('always', DeprecationWarning)
...
>>> model.computeCost(df)
/Users/mgaido/apache/spark/python/pyspark/ml/clustering.py:330: DeprecationWarning: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead.
  " instead.", DeprecationWarning)

@SparkQA
Copy link

SparkQA commented Feb 16, 2018

Test build #87510 has finished for PR 20629 at commit 2f79bb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Feb 18, 2018

Just want to check - does computeCost do the same thing as the silhouette metric?

@mgaido91
Copy link
Contributor Author

thanks for taking a look at this @MLnick. No, it doesn't, in the sense that it returns a different result: this is the sum of the squared euclidean distance between a point and the centroid of the cluster it is assigned to, while the silhouette metric is the average of the silhouette coefficient. So they are completely different formulas.

The semantic is a bit different too. Silhouette measures both cohesion and separation of the clusters, while computeCost as it is measures only cohesion.

Nonetheless, of course both them can be used to evaluate the result of a clustering algorithm, even though the silhouette is much better for this purpose.

@MLnick
Copy link
Contributor

MLnick commented Feb 18, 2018 via email

@MLnick
Copy link
Contributor

MLnick commented Feb 18, 2018 via email

@mgaido91
Copy link
Contributor Author

yes, I agree with you @MLnick. I'd like to ping also @jkbradley and @hhbyyh who drove the PR which introduce computeCost.

In order to move the same metric to ClusteringEvaluator, I think it is very useful what is done in this ongoing PR #20600, where DistanceMeasure is moved to its own file and the cost method is added to it. So maybe it can be added there after that PR. What do you think?

PS if you agree, may you please then help reviewing that PR?

Thanks.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Mar 1, 2018

sorry @MLnick, what do you think about my previous comment? Any thoughts? Thanks.

@mgaido91
Copy link
Contributor Author

@MLnick I checked and adding the computeCosts to ClusteringEvaluator has a small drawback: we have to compute the centers for each cluster and then we can compute the costs, which involved two passes on the dataset.

Since this and that this evaluation looks not very useful in practice, is it worth according to you to add it nonetheless?

Thanks.

@SparkQA
Copy link

SparkQA commented Mar 16, 2018

Test build #88307 has finished for PR 20629 at commit 05680ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

kindly ping @MLnick @jkbradley @hhbyyh

@mgaido91
Copy link
Contributor Author

mgaido91 commented Apr 6, 2018

kindly ping @MLnick @jkbradley @hhbyyh

@holdenk
Copy link
Contributor

holdenk commented Apr 6, 2018

So when you say "second pass over the data" - from looking at this it seems like it would could do this with just a second map to look up the predictions in the already computed cluster centers, not a stage boundary, so that probably wouldn't be all that expensive given how Spark does pipe-lining unless I'm mussing something.

This would mean that we'd have to have people set the cluster centers from their model when they wanted to do that evaluation type but given that the evaluate wouldn't be able to recover the cluster centers from a test that differed from the training set I think that would be reasonable.

That being said its been awhile since I've looked at the evaluator code so I could be coming out of left field.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Apr 6, 2018

@holdenk I am not sure I got 100% what you meant, so I'll try to answer but let me know if I missed something.

The problem of doing 2 passes is related to cluster centers. The API of ClusteringEvaluator (as of any Evaluator) is very simple: it is has a method which gets a Dataset and returns a value. So, unlike the method here - which is part of the KMeansModel and it can get the cluster centers from it -, there is no clue about the cluster centers.

I understand that you are suggesting to add a setClusterCenters method on the ClusteringEvaluator, but I am not sure whether this is worth since they are needed only for this metric, while for the others so far (the Silhouette measure) they are useless. Moreover, this metric was introduced explicitly as a temp fix because we were missing any other (better) evaluation metric and it was supposed to be dismissed once a better evaluation metric would have been introduced (please see the related JIRA and PR). So I am not sure that introducing a new method specifically for this metric is a good idea. Anyway, I added it. So you can check it out and we can decide whether it is worth to maintain it or not.

What do you think?

@SparkQA
Copy link

SparkQA commented Apr 11, 2018

Test build #89207 has finished for PR 20629 at commit 9c8cc67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 11, 2018

Test build #89205 has finished for PR 20629 at commit ca8c2ec.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some draft comments, lets see what @MLnick thinks of this. If he doesn't have time that's ok but I'd like his input.

I think after seeing this with the new param it maybe makes sense to not require the param (sorry) and do the double computation when the param isn't set so that the distance measure is more like the others but we also don't introduce a slow down for folks moving from the deprecated code path. What do you think?

/**
* param for metric name in evaluation
* (supports `"silhouette"` (default))
* (supports `"silhouette"` (default), `"kmeansCost"`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to consider kmeansCost a legacy function lets call it out as such so new people don't start adding a hard dependency to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but does it make sense to introduce something which is already considered legacy when introduced?

I think this brigs up again the question: shall we maintain a metric which was introduced only temporary as a fallback due to the lack of better metrics?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking I think it would make sense to maintain the fall-back metric until at least Spark 3.0 at which point I think it would make sense to ask on the user and dev lists and see if anyone is hard dependencies on it or if it is safe to remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I agree. Let's go on this way then, thanks.

/**
* param for distance measure to be used in evaluation
* (supports `"squaredEuclidean"` (default), `"cosine"`)
* (supports `"squaredEuclidean"` (default), `"cosine"`, `"euclidean"`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some models only support some ditance measures we should make that clear in the docs.

..note:: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead.
"""
warnings.warn("Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do go this path we need to file a follow up JIRA to update Python ClusteringEvaluator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, or I can also update it here once we establish for sure what the new API has to look like, as you prefer.

@mgaido91
Copy link
Contributor Author

@holdenk I am not sure about requiring or not cluster centers for this metric. On one side, since the ClusteringEvaluator should be a general interface for all clustering models and some of them don't provide cluster centers, it may be a good idea to compute them if necessary. On the other, does this metric make sense for any model other than KMeans? And computing the centers of the test dataset would lead to different results than the old API we are replacing. So I am not sure it is the right thing to do.

Honestly, the more we go on the more my feeling is that we don't really need to move that metric here. We can just deprecate it saying that there are better metrics for evaluating a clustering available in the ClusteringEvaluator (namely the silhouette). In these way people can move away from using this metric.

Moreover, sklearn - which is one of the most widespread tool - doesn't offer the ability of computing such a cost (http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). The only thing sklearn offers is what it calls inertia (https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/cluster/k_means_.py#L265), ie. the cost computed on the training set.

So, I think the best option would be to follow what sklearn does:

1 - Introducing in the KMeansSummary (or KMeansModel if you prefer) the cost attribute on the training set
2 - deprecate this method redirecting to ClusteringEvaluator for better metrics and/or to the cost attribute introduced

What do you think?

@mgaido91
Copy link
Contributor Author

kindly ping @holdenk @MLnick

1 similar comment
@mgaido91
Copy link
Contributor Author

kindly ping @holdenk @MLnick

@holdenk
Copy link
Contributor

holdenk commented Jun 13, 2018

cc @sethah

@holdenk
Copy link
Contributor

holdenk commented Jun 22, 2018

ping @sethah :)

@holdenk
Copy link
Contributor

holdenk commented Jun 28, 2018

ping @sethah?

@sethah
Copy link
Contributor

sethah commented Jun 28, 2018

+1 for @mgaido91's plan

@mgaido91
Copy link
Contributor Author

Thanks @holdenk @sethah, then I'll update this PR accordingly. Thanks.

@SparkQA
Copy link

SparkQA commented Jun 29, 2018

Test build #92475 has finished for PR 20629 at commit be954c6.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2018

Test build #92476 has finished for PR 20629 at commit 5a5c31f.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2018

Test build #92477 has finished for PR 20629 at commit 701b98a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

@holdenk @sethah any more comments on this? As 2.4 release is approaching I think it would be great to have this in, so we can remove the deprecated APIs in 3.0... Thanks.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this different approach, I agree we should get this in before 2.4 closes so we can remove in 3 line. One comment though for API compatibility with the constructor. Really hope we can merge this soon :)

@Since("0.8.0")
class KMeansModel @Since("2.4.0") (@Since("1.0.0") val clusterCenters: Array[Vector],
@Since("2.4.0") val distanceMeasure: String)
@Since("2.4.0") val distanceMeasure: String, @Since("2.4.0") val trainingCost: Double)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we changed the constructor here, and since it is not private, we should provide a similar (and deprecated) constructor without training cost which calls this with the default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor was introduced by a previous PR for 2.4, so this was never out (therefore I think we don't need to keep it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's awesome, lets try and get that fixed up before 2.4 goes so we don't have to any workarounds :)

@holdenk
Copy link
Contributor

holdenk commented Jul 13, 2018

After I merged your other PR this has some minor conflicts so needs an update, but I'd be happy to try and get this in end of next week during my next review session :)

@SparkQA
Copy link

SparkQA commented Jul 15, 2018

Test build #93016 has finished for PR 20629 at commit 926c353.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for getting this ready for removal in 3 :)

@holdenk
Copy link
Contributor

holdenk commented Jul 20, 2018

Merged to master :) Thank you :)

@asfgit asfgit closed this in cc4d64b Jul 20, 2018
@mgaido91
Copy link
Contributor Author

Thank you for reviewing @holdenk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants