Skip to content

Conversation

@wangmiao1981
Copy link
Contributor

What changes were proposed in this pull request?

Add R wrapper for bisecting Kmeans.

As JIRA is down, I will update title to link with corresponding JIRA later.

How was this patch tested?

Add new unit tests.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71280 has finished for PR 16566 at commit 4f88cce.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class BisectingKMeansWrapperWriter(instance: BisectingKMeansWrapper) extends MLWriter
  • class BisectingKMeansWrapperReader extends MLReader[BisectingKMeansWrapper]

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71282 has finished for PR 16566 at commit e7ea299.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

* checking Rd \usage sections ... WARNING
Duplicated \argument entries in documentation object 'fitted':
  'object' 'method' '...'

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71298 has started for PR 16566 at commit 2ad596e.

@wangmiao1981
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71337 has finished for PR 16566 at commit 2ad596e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981 wangmiao1981 changed the title [SparkR]: add bisecting kmeans R wrapper [SPARK-18821][SparkR]: Bisecting k-means wrapper in SparkR Jan 13, 2017
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need data(iris)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should go to @rdname spark.bisectingKmeans

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much is returned from fitted? should this be a list (like in summary) instead of DataFrame?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fitted in bisectingKmeans is quite similar to fitted in Kmeans. I followed that style to return a dataframe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of 1, find last?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterCenters is already an Array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is Array[Vector]. I need flatmap to transform it into Array[Double], which is similar to Kmeans.
In addition, we have the serialization bug of not supporint Vector type open.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move minDivisibleClusterSize to the end since it's expert parameter and add note in param doc above (should be examples in mllib-tree.R)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address comments soon. Now, debugging. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems method parameter is not optional (there is no default value) - so the example would need to show that as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably get some feedback on this - none of the current ML model has a fitted method - should we have this now? or should this be a option/parameter of the summary method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.kmeans has the fitted method. As these two are similar, I added it to bisecting kmeans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I didn't recall that. I think that's ok then

Copy link
Contributor Author

@wangmiao1981 wangmiao1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments addressed.

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71675 has finished for PR 16566 at commit e77cbaf.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71683 has finished for PR 16566 at commit 83b2d6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' @param seed the random seed.
#' @param minDivisibleClusterSize The minimum number of points (if greater than or equal to 1.0)
#' or the minimum proportion of points (if less than 1.0) of a divisible cluster.
#' Note that it is an advanced. The default value should be enough
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it is an advanced.
do you mean to say Note that it is an advanced option.?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I recall the term used in spark.ml doc is "expert parameter" - you might want to check how it is explained there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scala, it uses @group expertParam in the document and the API document shows (expert-only) Parameters. I will change it to it is an expert parameter.

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71706 has started for PR 16566 at commit b25fc83.

@wangmiao1981
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71731 has finished for PR 16566 at commit b25fc83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor Author

Close to trigger windows test

@wangmiao1981
Copy link
Contributor Author

open to trigger windows test

@wangmiao1981 wangmiao1981 reopened this Jan 20, 2017
#' \dontrun{
#' model <- spark.bisectingKmeans(trainingData, ~ ., 2)
#' fitted.model <- fitted(model, "centers")
#' showDF(fitted.model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, if you might end up another iteration, I'd suggest moving the example to before setMethod("spark.bisectingKmeans" - that's generally our guideline (and param) to have them in the same place if they have the same rdname (ie. going to the same page)

#' The list includes the model's \code{k} (number of cluster centers),
#' \code{coefficients} (model cluster centers),
#' \code{size} (number of data points in each cluster), and \code{cluster}
#' (cluster centers of the transformed data).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add is.loaded here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also clarify cluster is NULL if is.loaded = T


lazy val k: Int = bisectingKmeansModel.getK

lazy val cluster: DataFrame = bisectingKmeansModel.summary.cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have valid values when the model is loaded?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this is checked on the R side. could you add a comment here

.fit(data)

val bisectingKmeansModel: BisectingKMeansModel =
pipeline.stages(1).asInstanceOf[BisectingKMeansModel]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's be consistent here with L38 - either (1) or last

@felixcheung
Copy link
Member

couple of last comments.
@yanboliang do you have any comment?

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71828 has started for PR 16566 at commit d36c23a.

@wangmiao1981
Copy link
Contributor Author

Jenkins, retest this please.

@felixcheung
Copy link
Member

LGTM

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71865 has finished for PR 16566 at commit d36c23a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

merged to master. Let's follow up with programming guide, example and vignettes - would you be able to pick these up too @wangmiao1981 ?

@asfgit asfgit closed this in c0ba284 Jan 27, 2017
@wangmiao1981
Copy link
Contributor Author

@felixcheung I will take care of it very soon. Now I am working on the PR of vector serialization. Also, I started working on the SparkR serialization performance. Thanks!

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Add R wrapper for bisecting Kmeans.

As JIRA is down, I will update title to link with corresponding JIRA later.

## How was this patch tested?

Add new unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16566 from wangmiao1981/bk.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

Add R wrapper for bisecting Kmeans.

As JIRA is down, I will update title to link with corresponding JIRA later.

## How was this patch tested?

Add new unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#16566 from wangmiao1981/bk.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants