[SPARK-25765][ML] Add training cost to BisectingKMeans summary #22764

mgaido91 · 2018-10-18T10:31:37Z

What changes were proposed in this pull request?

The PR adds the trainingCost value to the BisectingKMeansSummary, in order to expose the information retrievable by running computeCost on the training dataset. This fills the gap with KMeans implementation.

How was this patch tested?

improved UTs

SparkQA · 2018-10-18T11:44:14Z

Test build #97529 has finished for PR 22764 at commit 5919d3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-10-18T13:31:05Z

cc @cloud-fan @dongjoon-hyun @holdenk @srowen

cloud-fan · 2018-10-18T13:41:27Z

does the example need to be updated with this new API?

mgaido91 · 2018-10-18T13:58:23Z

For KMeans we used the ClusteringEvaluator in the examples. Actually, the training cost is not a good way to evaluate a dataset (the evaluation should be done on a dataset different from the training one).

Maybe we can also add this API in the example to show that it exists, but I have seen no example with a summary shown so...

viirya · 2018-10-18T15:39:36Z

mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala

    k: Int,
-    numIter: Int) extends ClusteringSummary(predictions, predictionCol, featuresCol, k, numIter)
+    numIter: Int,
+    @Since("2.4.0") val trainingCost: Double)


2.4.0? or 3.0.0?

this PR targets to 2.4, see more context at #22756

oh wait. If the final goal is to have a consistent ML API in 3.0, do we have to put this new API in 2.4?

I wouldn't consider this as mandatory. I think what is mandatory to target for 2.4 is to deprecate the computeCost method. But I think it is a nice to have, since it is a non-deprecated way users have to access this information. In the PR related to KMeans, there was quite a discussion about it and it was considered to be part of the deprecation change.

Ok. I see. Then looks like it is nice to have this in 2.4.

Since #22756 is reverted, are we going to change the Since version for this, too?

viirya · 2018-10-19T02:00:02Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala

      assert(formatVersion == thisFormatVersion)
      val rootId = (metadata \ "rootId").extract[Int]
      val distanceMeasure = (metadata \ "distanceMeasure").extract[String]
+      val trainingCost = (metadata \ "trainingCost").extract[Double]


hmm, can this read old model from previous version?

Could you avoid modifying loading model code in "mllib" package, but modifying code in "ml" package, i.e., the class ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader, you can reference the KMeans code: ml.clustering.KMeansModel.KMeansModelReader.
(Don't let ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader call mllib.clustering.BisectingKMeansModel.load)

And, +1 with @viirya mentioned, we should keep model loading compatibility, add a version check (when >= 2.4) then we load "training cost" . Note that add these in ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader.

And, could you also add version check (when >= 2.4) then we load "training cost" into ml.clustering.KMeansModel.KMeansModelReader ?

Do other models have this problem? I was told that this change just follows what we did for other models before.

Thank you all for the comments and sorry for the late answer. Just a couple of notes on your comments @WeichenXu123 (I may be missing something, so please correct me if I am wrong):

I checked the ml.clustering.KMeansModel.KMeansModelReader and it doesn't store anything related to the summary. Summary is not recovered after load of the model, so I don't see any reason why we should modify the read/load of ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader;

this model can read from previous versions, since this is version "2.0", which was introduced for Spark 2.4; for previous versions, we read/write version "1.0"; the version check method for versioning is used only for the ml package, not in mllib where we have this versioning approach;

I was told that this change just follows what we did for other models before.

@cloud-fan Yes, let me link the PR for KMeans doing the same, which is: #20629.

Just a final comment which I hope clarifies which is the source of the confusion here and the reason of the above comments by @viirya and @WeichenXu123: trainingCost here is a member of the summary, not a parameter of the model for the ml.clustering.BisectingKMeansModel. Instead, it is a member of the model for mllib.clustering.BisectingKMeansModel (we have no summary notion there...). So storing it for mllib is needed in order for the model read after persisting it being the same of the original one (I think it doesn't pass UTs otherwise). Storing it for the ml, instead, it is not needed, because the summary is not persisted. If we want to persist also the summary for ml package I think we should best create a separate JIRA and PR for it.

Hope this clarifies (sorry for being so verbose).

this model can read from previous versions, since this is version "2.0", which was introduced for Spark 2.4; for previous versions, we read/write version "1.0"; the version check method for versioning is used only for the ml package, not in mllib where we have this versioning approach;

I meant that can it read old model from previous versions, not that this model can read from previous versions.

In other words, when reading a previous model without "trainingCost" in metadata, can this line work well?

val trainingCost = (metadata \ "trainingCost").extract[Double]

@mgaido91
(I haven't test this, so correct me if I am wrong).

I don't see (and think) this change breaks backwards compatibility for mllib.

I am suspicious of this line in load method:

val trainingCost = (metadata \ "trainingCost").extract[Double]

When loading an old version spark(e.g. spark 2.3.1) saved BisectingKMeansModel, because it do not contain "trainingCost" info, I guess this line will throw error. (Otherwise what will it return ?)

@WeichenXu123 I have explained it in and #22764 (comment). If you don't agree or believe on what I said you can try it.

A model saved in 2.3.1 will have "1.0" as version. So this code is not run. Every model from 2.4.0 on, will be saved with "2.0" as version, so it will have this stored. As mentioned, please notice that SaveLoadV2_0 was introduced for 2.4.0. Of course, if this commit won't go in 2.4, then I'll have to create a SaveLoadV3_0 in order to support it (or, if we agree that this doesn't need to be restored after model persistence, we can just ignore it).

Hope this clarifies. Thanks.

Sorry, I am more confusing...

I think this line should be a already existed mistake. It is too weird.
I think it should be case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) => val model = SaveLoadV2_0.load(sc, path)

Suppose you're right, then in which place your code call SaveLoadV2_0 ? I don't find it ... ?

yes @WeichenXu123 , you're right, that line is a bug. Thanks for noticing it. Anyway, that is going to be addressed in another PR and it is not (strictly) related to this one. The other option, as I mentioned, is that if we agree that this doesn't need to be restored after model persistence, we can just ignore it in save/load.

OK. After #22790 merged, I think this PR can work.

WeichenXu123

Several suggestion. Thanks!

WeichenXu123 · 2018-10-19T02:36:59Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala

      assert(formatVersion == thisFormatVersion)
      val rootId = (metadata \ "rootId").extract[Int]
      val distanceMeasure = (metadata \ "distanceMeasure").extract[String]
+      val trainingCost = (metadata \ "trainingCost").extract[Double]


Could you avoid modifying loading model code in "mllib" package, but modifying code in "ml" package, i.e., the class ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader, you can reference the KMeans code: ml.clustering.KMeansModel.KMeansModelReader.
(Don't let ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader call mllib.clustering.BisectingKMeansModel.load)

And, +1 with @viirya mentioned, we should keep model loading compatibility, add a version check (when >= 2.4) then we load "training cost" . Note that add these in ml.clustering.BisectingKMeansModel.BisectingKMeansModelReader.

And, could you also add version check (when >= 2.4) then we load "training cost" into ml.clustering.KMeansModel.KMeansModelReader ?

cloud-fan · 2018-10-20T01:27:32Z

Since this PR is a little more complicated than we expect, we decided to not have it in 2.4.0. I'm not sure if we can treat it as a special case and put it in 2.4.1, cc @mengxr

Anyway, the other 2 related PRs(deprecating the API and updating the example) are reverted. We need to think about what we should do if we can only do this in 3.0.

mgaido91 · 2018-10-26T14:43:30Z

as @WeichenXu123 mentioned in #22764 (comment), I don't see other problems with the current PR. The only thing is: do we want to target it for 2.4 or for 3.0? If the latter, I'll update the PR with the proper deprecation messages. Thanks.

WeichenXu123 · 2018-10-26T15:00:55Z

I think it can target for 3.0. since 2.4 will be released soon and this PR looks a little complex and need take some time to check carefully.

mgaido91 · 2018-10-28T10:01:16Z

thanks @WeichenXu123 , I updated this PR in order to target 3.0.

SparkQA · 2018-10-28T11:17:02Z

Test build #98169 has finished for PR 22764 at commit 0c74a09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-11-05T09:58:52Z

any more comments on this? Thanks

cloud-fan · 2018-11-05T10:26:52Z

cc @dbtsai

mgaido91 · 2018-11-10T11:54:11Z

@dbtsai any comments on this? thanks.

dbtsai · 2018-11-12T00:00:07Z

@mgaido91 I'm on thanksgiving vacation, will be back to community to help code review on Nov 21st. Sorry for the delay.

mgaido91 · 2018-11-12T09:16:52Z

@dbtsai sure, thanks. Sorry for bothering you. Have a nice vacation!

mgaido91 · 2018-11-28T09:30:43Z

@dbtsai any luck with this? Thanks.

mgaido91 · 2018-12-10T11:38:33Z

kindly ping @dbtsai

mgaido91 · 2018-12-18T13:33:41Z

since this has been stuck for a while, maybe @holdenk @srowen can you help reviewing this? thanks.

srowen

My only real concern is that old models have a training cost of 0, when it's unknown really. I don't think it's worth making the new value an Option[Double] because it's not really that optional. If we can compute it in more cases, that's great, would be fine. If not, probably still OK, just less ideal.

srowen · 2018-12-18T15:12:17Z

mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala

 * @param featuresCol  Name for column of features in `predictions`.
 * @param k  Number of clusters.
 * @param numIter  Number of iterations.
+ * @param trainingCost Sum of squared distances to the nearest centroid for all points in the


Would the cost ever be something besides sum of squares? maybe not, just wondering if we should say here what the cost function is

yes, you're right, let me update it with a more generic "cost", thanks.

srowen · 2018-12-18T15:13:04Z

mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala

    k: Int,
-    numIter: Int) extends ClusteringSummary(predictions, predictionCol, featuresCol, k, numIter)
+    numIter: Int,
+    @Since("3.0.0") val trainingCost: Double)


I don't think it's a big deal for 3.0, but we lose the constructor without the new param. That's probably OK as the summary kind of needs this value.

this constructor is private so I don't think it is a problem to avoid having the previous one.

srowen · 2018-12-18T15:14:06Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala


  @Since("1.6.0")
-  def this(root: ClusteringTreeNode) = this(root, DistanceMeasure.EUCLIDEAN)
+  def this(root: ClusteringTreeNode) = this(root, DistanceMeasure.EUCLIDEAN, 0.0)


On the other hand, we did preserve this old constructor, and that's fine to keep. The other issue I see here is that the cost is 0, when the cost is really unknown.

yes, because this is public, so users may rely on it. The idea is that this is indeed a "new feature" (previously is was not accessible) and we are not guaranteeing new features in the MLLib API. I just followed the same approach which was used for KMeans.

srowen · 2018-12-18T15:14:35Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala

      val nodes = data.rdd.map(Data.apply).collect().map(d => (d.index, d)).toMap
      val rootNode = buildTree(rootId, nodes)
-      new BisectingKMeansModel(rootNode, DistanceMeasure.EUCLIDEAN)
+      new BisectingKMeansModel(rootNode, DistanceMeasure.EUCLIDEAN, 0.0)


Is it possible to compute the cost here after load rather than setting it to 0?

Despite it would be possible, I don't think it is a good idea, as this is only for the old MLLib API and it would introduce a significant performance overhead (a pass over all the dataset) for an information which may not be useful at all...

Would it not just be the same? rootNode.leafNodes.map(_.cost).sum? If that cost info is present in the nodes (?) it doesn't need a pass over data (which indeed doesn't exist at this point). If it's valuable enough to include at all, should this info not be correct where it is in fact available?

yes, you're right, thanks. It is indeed available. I am doing it, thanks.

I think it looks OK except for this comment?

yes, right, sorry, I missed it, I did it only for the version 2.0 and missed this one. I am updating it, thanks.

SparkQA · 2018-12-24T11:28:44Z

Test build #100420 has finished for PR 22764 at commit e44adff.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-24T19:54:56Z

Test build #100427 has finished for PR 22764 at commit 4454412.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-12-25T09:36:55Z

retest this please

SparkQA · 2018-12-25T14:16:29Z

Test build #100433 has finished for PR 22764 at commit 4454412.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-31T20:33:40Z

Test build #100597 has finished for PR 22764 at commit 8ef04db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-01-01T15:19:16Z

Merged to master

## What changes were proposed in this pull request? The PR adds the `trainingCost` value to the `BisectingKMeansSummary`, in order to expose the information retrievable by running `computeCost` on the training dataset. This fills the gap with `KMeans` implementation. ## How was this patch tested? improved UTs Closes apache#22764 from mgaido91/SPARK-25765. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

mgaido91 added 2 commits October 18, 2018 12:29

[SPARK-25765][ML] Add training cost to BisectingKMeans summary

6460fe7

fix

5919d3c

viirya reviewed Oct 18, 2018

View reviewed changes

viirya reviewed Oct 19, 2018

View reviewed changes

WeichenXu123 requested changes Oct 19, 2018

View reviewed changes

holdenk mentioned this pull request Oct 19, 2018

[SPARK-25758][ML] Deprecate computeCost on BisectingKMeans #22756

Closed

mgaido91 added 3 commits October 28, 2018 10:57

target 3.0

de5cadd

Merge branch 'master' of github.com:apache/spark into SPARK-25765

b7a6b51

fix

0c74a09

mgaido91 mentioned this pull request Nov 20, 2018

[SPARK-25867][ML] Remove KMeans computeCost #22875

Closed

srowen reviewed Dec 18, 2018

View reviewed changes

address comments

e44adff

mgaido91 added 2 commits December 24, 2018 16:17

Merge branch 'master' of github.com:apache/spark into SPARK-25765

a3247a6

fix mima

4454412

address comment

8ef04db

srowen approved these changes Dec 31, 2018

View reviewed changes

srowen closed this in 001d309 Jan 1, 2019

[SPARK-25765][ML] Add training cost to BisectingKMeans summary #22764

[SPARK-25765][ML] Add training cost to BisectingKMeans summary #22764

Uh oh!

Conversation

mgaido91 commented Oct 18, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 18, 2018

Uh oh!

mgaido91 commented Oct 18, 2018

Uh oh!

cloud-fan commented Oct 18, 2018

Uh oh!

mgaido91 commented Oct 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Oct 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Oct 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Oct 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 20, 2018

Uh oh!

mgaido91 commented Oct 26, 2018

Uh oh!

WeichenXu123 commented Oct 26, 2018

Uh oh!

mgaido91 commented Oct 28, 2018

Uh oh!

SparkQA commented Oct 28, 2018

Uh oh!

mgaido91 commented Nov 5, 2018

Uh oh!

cloud-fan commented Nov 5, 2018

Uh oh!

mgaido91 commented Nov 10, 2018

Uh oh!

dbtsai commented Nov 12, 2018

Uh oh!

mgaido91 commented Nov 12, 2018

WeichenXu123 Oct 19, 2018 •

edited

Loading

WeichenXu123 Oct 20, 2018 •

edited

Loading

WeichenXu123 Oct 19, 2018 •

edited

Loading