Skip to content

Conversation

@mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

The PR adds the option to specify a distance measure in BisectingKMeans. Moreover, it introduces the ability to use the cosine distance measure in it.

How was this patch tested?

added UTs + existing UTs

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87396 has finished for PR 20600 at commit 343ab72.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait HasDistanceMeasure extends Params

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87401 has finished for PR 20600 at commit ed9b55f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87400 has finished for PR 20600 at commit 8625bff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87399 has finished for PR 20600 at commit 87c3ebd.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Feb 13, 2018

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Feb 14, 2018

Test build #87424 has finished for PR 20600 at commit ed9b55f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

cc @srowen @viirya @zhengruifeng

@mgaido91
Copy link
Contributor Author

mgaido91 commented Mar 1, 2018

@srowen @viirya @zhengruifeng sorry, did you have time to take a look at this? Any thoughts? Thanks.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the API changes, I'd still want @MLnick or @jkbradley or @sethah to see if they see any other issues here. It looks like straightforward work here, but it does refactor a lot and does actually change an API.


private[clustering] object SaveLoadV1_0 {
private val thisFormatVersion = "1.0"
private[clustering] val thisFormatVersion = "1.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this need to be more visible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is used about 30 lines before, in the load method. Previously it was hard-coded the value there

}

def load(sc: SparkContext, path: String, rootId: Int): BisectingKMeansModel = {
def load(sc: SparkContext, path: String): BisectingKMeansModel = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changed the signature of load, as MiMa notes. I'm not sure you can do this? though I'm not so clear on the semantics of this serialization method. Can it be avoided? I'd have thought the older serialization could be left as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but this is the load method of the object SaveLoadV1_0 which is marked as private[clustering]. The real load method has no change in the signature, so I don't think this is a problem.

I think that this change can't be avoided. If we don't update this and the user happens to use the mllib implementation, instead of ml, what happens is that he/she can set the distance measure successfully, but if he/she saves the model and loads it, this information is lost and it will default to the euclidean distance measure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, that's OK then. It's a false positive as it's public in the byte code but not really public.

ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasDistanceMeasure.org$apache$spark$ml$param$shared$HasDistanceMeasure$_setter_$distanceMeasure_="),
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasDistanceMeasure.getDistanceMeasure"),
ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasDistanceMeasure.distanceMeasure"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.mllib.clustering.BisectingKMeansModel#SaveLoadV1_0.load"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above; this one I think needs to be avoided. The others are issues too but probably acceptable in a minor release. I don't know if people extend these implementations. That said, any way to work around this by providing a dummy implementation of the new methods? I haven't thought through the Scala here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add the methods directly to the BisectingKMeansModel, but then we have the same code duplicated. So I think this solution is cleaner. I saw the same was done for HasOutputCols in 2.3.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Mar 10, 2018

any more comments @srowen ? any comment @MLnick, @jkbradley, @sethah?

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK but let me ping people like @jkbradley @sethah one more time. I think your logic about the MiMa changes is sound but also feel I don't know the subtleties of this part well.

@SparkQA
Copy link

SparkQA commented Mar 12, 2018

Test build #4139 has finished for PR 20600 at commit ed9b55f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Mar 12, 2018

Merged to master

@asfgit asfgit closed this in 567bd31 Mar 12, 2018
mstewart141 pushed a commit to mstewart141/spark that referenced this pull request Mar 24, 2018
## What changes were proposed in this pull request?

The PR adds the option to specify a distance measure in BisectingKMeans. Moreover, it introduces the ability to use the cosine distance measure in it.

## How was this patch tested?

added UTs + existing UTs

Author: Marco Gaido <marcogaido91@gmail.com>

Closes apache#20600 from mgaido91/SPARK-23412.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants