-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23412][ML] Add cosine distance to BisectingKMeans #20600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #87396 has finished for PR 20600 at commit
|
|
Test build #87401 has finished for PR 20600 at commit
|
|
Test build #87400 has finished for PR 20600 at commit
|
|
Test build #87399 has finished for PR 20600 at commit
|
|
Jenkins, retest this please |
|
Test build #87424 has finished for PR 20600 at commit
|
|
@srowen @viirya @zhengruifeng sorry, did you have time to take a look at this? Any thoughts? Thanks. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the API changes, I'd still want @MLnick or @jkbradley or @sethah to see if they see any other issues here. It looks like straightforward work here, but it does refactor a lot and does actually change an API.
|
|
||
| private[clustering] object SaveLoadV1_0 { | ||
| private val thisFormatVersion = "1.0" | ||
| private[clustering] val thisFormatVersion = "1.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this need to be more visible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is used about 30 lines before, in the load method. Previously it was hard-coded the value there
| } | ||
|
|
||
| def load(sc: SparkContext, path: String, rootId: Int): BisectingKMeansModel = { | ||
| def load(sc: SparkContext, path: String): BisectingKMeansModel = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changed the signature of load, as MiMa notes. I'm not sure you can do this? though I'm not so clear on the semantics of this serialization method. Can it be avoided? I'd have thought the older serialization could be left as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but this is the load method of the object SaveLoadV1_0 which is marked as private[clustering]. The real load method has no change in the signature, so I don't think this is a problem.
I think that this change can't be avoided. If we don't update this and the user happens to use the mllib implementation, instead of ml, what happens is that he/she can set the distance measure successfully, but if he/she saves the model and loads it, this information is lost and it will default to the euclidean distance measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, that's OK then. It's a false positive as it's public in the byte code but not really public.
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasDistanceMeasure.org$apache$spark$ml$param$shared$HasDistanceMeasure$_setter_$distanceMeasure_="), | ||
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasDistanceMeasure.getDistanceMeasure"), | ||
| ProblemFilters.exclude[InheritedNewAbstractMethodProblem]("org.apache.spark.ml.param.shared.HasDistanceMeasure.distanceMeasure"), | ||
| ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.mllib.clustering.BisectingKMeansModel#SaveLoadV1_0.load"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above; this one I think needs to be avoided. The others are issues too but probably acceptable in a minor release. I don't know if people extend these implementations. That said, any way to work around this by providing a dummy implementation of the new methods? I haven't thought through the Scala here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add the methods directly to the BisectingKMeansModel, but then we have the same code duplicated. So I think this solution is cleaner. I saw the same was done for HasOutputCols in 2.3.
|
any more comments @srowen ? any comment @MLnick, @jkbradley, @sethah? |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK but let me ping people like @jkbradley @sethah one more time. I think your logic about the MiMa changes is sound but also feel I don't know the subtleties of this part well.
|
Test build #4139 has finished for PR 20600 at commit
|
|
Merged to master |
## What changes were proposed in this pull request? The PR adds the option to specify a distance measure in BisectingKMeans. Moreover, it introduces the ability to use the cosine distance measure in it. ## How was this patch tested? added UTs + existing UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#20600 from mgaido91/SPARK-23412.
What changes were proposed in this pull request?
The PR adds the option to specify a distance measure in BisectingKMeans. Moreover, it introduces the ability to use the cosine distance measure in it.
How was this patch tested?
added UTs + existing UTs