[SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans #11844

zhengruifeng · 2016-03-19T09:16:03Z

What changes were proposed in this pull request?

1, add BisectingKMeans to ml-clustering.md
2, add the missing Scala BisectingKMeansExample
3, create a new datafile data/mllib/sample_kmeans_data.txt

How was this patch tested?

manual tests

SparkQA · 2016-03-19T09:29:30Z

Test build #53615 has finished for PR 11844 at commit 80fa565.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-04-01T16:50:16Z

docs/ml-clustering.md

Exchange the two lines above, it's better to give the overview of Bisecting k-means firstly.

ok, I will fix this.

SparkQA · 2016-04-02T03:34:54Z

Test build #54755 has finished for PR 11844 at commit 3bf4137.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-04-09T00:32:38Z

cc @jkbradley

MLnick · 2016-05-05T12:50:24Z

examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala

use SparkSession builder pattern as per #12281 (comment)

MLnick · 2016-05-05T12:52:56Z

@zhengruifeng I would like to update the JavaKMeansExample to be in line with this one and the Scala KMeansExample. Also I prefer to just read the example data data/mllib/kmeans_data.txt, it just makes the examples more succinct.

MLnick · 2016-05-05T12:56:55Z

Same goes for the JavaBisectingKMeansExample, read the example data.

SparkQA · 2016-05-05T13:05:32Z

Test build #57880 has finished for PR 11844 at commit 22ccc7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-05-05T13:10:27Z

@MLnick OK. I will update BisectingKMeans examples (py/scala/java) in this PR to directly read the data file.

SparkQA · 2016-05-05T13:11:20Z

Test build #57881 has finished for PR 11844 at commit ddd90b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-05T14:25:46Z

Test build #57890 has finished for PR 11844 at commit e2aaabd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-05-05T20:35:16Z

@MLnick @zhengruifeng I am working on updating the KMeans examples and adding python. I will submit the PR soon

sethah · 2016-05-05T20:42:22Z

examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala

So, in updating the Kmeans examples, I had the same issues that make this code a bit ugly. I came up with:

val vecAssembler = new VectorAssembler() .setInputCols(Array("x", "y", "z")) .setOutputCol("features") val schema = StructType(Array( StructField("x", DataTypes.DoubleType), StructField("y", DataTypes.DoubleType), StructField("z", DataTypes.DoubleType))) val dataset = vecAssembler.transform( spark.read .format("csv") .option("sep", " ") .schema(schema) .load("data/mllib/kmeans_data.txt"))

I think it's a little bit better since we don't convert to RDD in what we claim is the "dataframe API," but I am not certain what is best. Thoughts? @MLnick

@sethah 's proposal is feasible. I'd like to use dataset with libsvm format, then we can load it use spark.read.format("libsvm"). We can get features with Vector type and feed them into model training directly. Although the dataset has label column but we don't use it actually. This will make the example more succinct.

Yeah, I propose we just use an existing LIBSVM example data file, or we can
create a new one from kmeans_example_data.

On Fri, 6 May 2016 at 10:08 Yanbo Liang notifications@github.com wrote:

In
examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala
#11844 (comment):

+/**

* An example demonstrating a bisecting k-means clustering.

* Run with

* {{{

* bin/run-example ml.BisectingKMeansExample

* }}}

*/
+object BisectingKMeansExample {
+

def main(args: Array[String]): Unit = {

// Creates a SparkSession

val spark = SparkSession.builder.appName("BisectingKMeansExample").getOrCreate()

// $example on$

// Crates a DataFrame

val rowRDD = spark.read.text("data/mllib/kmeans_data.txt").rdd

@sethah https://github.com/sethah 's proposal is feasible. I'd like to
use dataset with libsvm format, then we can load it directly use
spark.read.format("libsvm"). We can get features with Vector type and
feed them into model training. Although the dataset has label column but we
don't use actually. This will make the example more succinct.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/11844/files/e2aaabd318a76b6edc59a99cfbc0f6239c833c0c#r62299617

Good Idea. I will create a libsvm file containing the data in data/mllib/kmeans_data and use it in examples of KMeans and BisectKMeans

zhengruifeng · 2016-05-07T01:05:09Z

@MLnick @sethah @yanboliang

There is something wrong:

val dataset = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val kmeans = new KMeans().setK(2).setFeaturesCol("features")
val model = kmeans.fit(dataset)
val WSSSE = model.computeCost(dataset)

The WSSSE is always 0, and the model.clusterCenters only contains empty zero-length vectors Array[org.apache.spark.mllib.linalg.Vector] = Array([], []).

and if I run dataset.select("features").show, it fails

java.lang.RuntimeException: Error while decoding: scala.MatchError: 31 (of class java.lang.Byte)
createexternalrow(if (isnull(input[0, vector])) null else newInstance(class org.apache.spark.mllib.linalg.VectorUDT).deserialize, StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true))
+- if (isnull(input[0, vector])) null else newInstance(class org.apache.spark.mllib.linalg.VectorUDT).deserialize
   :- isnull(input[0, vector])
   :  +- input[0, vector]
   :- null
   +- newInstance(class org.apache.spark.mllib.linalg.VectorUDT).deserialize
      :- newInstance(class org.apache.spark.mllib.linalg.VectorUDT)
      +- input[0, vector]

zhengruifeng · 2016-05-07T01:08:30Z

The features type after spark.read.format("libsvm").load(..) is mllib.SparseVector.
DataSet can not handle mllib.SparseVector? Or KMeans ?

MLnick · 2016-05-07T08:52:06Z

Seems like a potential issue with libsvm relation - cc @viirya this seems different from the other bug you fixed!

This works:

scala> val df = Seq((0.0, Vectors.sparse(10, Seq((1, 1.0))))).toDF("label", "features")
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> df.select("features").show
+--------------+
|      features|
+--------------+
|(10,[1],[1.0])|
+--------------+

This throws error:

scala> val df2 = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> df2.select("features").show
java.lang.RuntimeException: Error while decoding: scala.MatchError: 19 (of class java.lang.Byte)
createexternalrow(if (isnull(input[0, vector])) null else newInstance(class org.apache.spark.mllib.linalg.VectorUDT).deserialize, StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true))
...

But selecting label and features works:

scala> df2.select("label", "features").show
+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
...

viirya · 2016-05-07T09:02:28Z

@MLnick I will take a look at this issue in these days.

SparkQA · 2016-05-08T02:56:24Z

Test build #58081 has finished for PR 11844 at commit d3a0be1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-05-08T03:08:35Z

@MLnick I updated the PRs of KMeans and BisectingKMeans to directly load data file data/mllib/sample_kmeans_data.txt.
However, the output results now are wrong.

viirya · 2016-05-08T09:29:00Z

@MLnick I opened a PR #12986 for that.

MLnick · 2016-05-09T12:42:23Z

docs/ml-clustering.md

Since we intend to make the ml docs "complete" (as in, as complete as mllib), could we detail the parameters as is done in the doc for the mllib algorithm.

This will probably need to be done across the board (but the doc parity work will be covered as part of SPARK-14815)

zhengruifeng · 2016-05-09T13:12:01Z

@MLnick sorry to involve other peoples' commits into this. I had to recreate this pr.
I have updated it according to your comments. Thanks

SparkQA · 2016-05-09T13:40:00Z

Test build #58143 has finished for PR 11844 at commit e6bef11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-09T13:50:00Z

Test build #58144 has finished for PR 11844 at commit 6c0a8ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-05-09T15:28:10Z

docs/ml-clustering.md

We don't seem to be listing the params in the ML user guides like was previously done in mllib. I also think this is hard to maintain: what if the default values change or new params are added?

@MLnick Should the params be listed like this?

Ok, perhaps leave out the params. We should be consistent with the rest of ml docs. But they themselves seem inconsistent - in some cases we list e.g. input / output columns, in many other cases we don't etc.

But we can discuss ml doc consistency on JIRA.

@MLnick I hadn't seen your comment suggesting to add params. I'm not super opposed to listing params, but I was leaning in favor of consistency between docs. I agree we can discuss this as another issue.

Your point is valid - this would be a bit out of place in the ml docs. I also agree that is does add a burden of keeping params and defaults in sync with the code. There's a good argument that the param doc lives in the API docs (as it does now for ml). Still, there's also a decent argument for having more detailed docs on params in the user guide, though perhaps only for very important ones (like an initialization scheme, or algorithm type etc).

Indeed, scikit-learn user guide and API docs seem to follow this style (as an example).

OK, I will remove it

wangmiao1981 · 2016-05-09T18:50:32Z

@zhengruifeng Can you make it sharing with GMM? Once your PR is merged, I can change mine to use your data. Thanks!

zhengruifeng · 2016-05-09T23:04:06Z

@wangmiao1981 Once this PR is merged, you can directly load the datafile in your PR.

MLnick · 2016-05-10T17:13:29Z

examples/src/main/python/ml/bisecting_k_means_example.py

I'd like to add the "run-with" instruction to the main doc string, e.g.

""" A simple example demonstrating bisecting k-means clustering. Run with: bin/spark-submit examples/src/main/python/ml/bisecting_k_means_example.py """

zhengruifeng · 2016-05-11T02:26:03Z

@MLnick Thanks. Updated

SparkQA · 2016-05-11T03:04:59Z

Test build #58314 has finished for PR 11844 at commit 77e73c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-05-11T06:43:36Z

examples/src/main/java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java

Could we make this consistent with KMeans? .e.g. System.out.println("Within Set Sum of Squared Errors = as per https://github.com/apache/spark/pull/12925/files#diff-a805bb5f394ef27cbb213325676c2007R56

All 3 examples can be updated

Good idea. I will do it.

MLnick · 2016-05-11T06:49:11Z

@zhengruifeng just a couple final comments to make these examples consistent with the KMeans examples. Then I think this is ready.

MLnick · 2016-05-11T07:15:25Z

LGTM pending jenkins

SparkQA · 2016-05-11T07:41:12Z

Test build #58342 has finished for PR 11844 at commit 8cd45d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-05-11T07:56:51Z

Merged to master and branch-2.0. Thanks!

…ectingKMeans ## What changes were proposed in this pull request? 1, add BisectingKMeans to ml-clustering.md 2, add the missing Scala BisectingKMeansExample 3, create a new datafile `data/mllib/sample_kmeans_data.txt` ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11844 from zhengruifeng/doc_bkm. (cherry picked from commit cef73b5) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>

zhengruifeng changed the title ~~[Minor][DOC] Add BisectingKMeans to ml-clustering.md~~ [Minor][DOC] Add Scala Example and Description for ml.BisectingKMeans Mar 31, 2016

mengxr mentioned this pull request Apr 1, 2016

[SPARK-14339][DOC] Add python examples for DCT,MinMaxScaler,MaxAbsScaler #12063

Closed

yanboliang reviewed Apr 1, 2016
View reviewed changes

zhengruifeng changed the title ~~[Minor][DOC] Add Scala Example and Description for ml.BisectingKMeans~~ [SPARK-14340][DOC] Add Scala Example and Description for ml.BisectingKMeans Apr 2, 2016

zhengruifeng force-pushed the doc_bkm branch from 80fa565 to 3bf4137 Compare April 2, 2016 03:18

zhengruifeng changed the title ~~[SPARK-14340][DOC] Add Scala Example and Description for ml.BisectingKMeans~~ [SPARK-14340][DOC] Add Scala Example and User Guide for ml.BisectingKMeans May 5, 2016

MLnick mentioned this pull request May 5, 2016

[SPARK-15149][EXAMPLE][DOC] update kmeans example #12925

Closed

MLnick reviewed May 5, 2016
View reviewed changes

zhengruifeng force-pushed the doc_bkm branch from 3bf4137 to 22ccc7f Compare May 5, 2016 12:56

sethah reviewed May 5, 2016
View reviewed changes

zhengruifeng force-pushed the doc_bkm branch from e2aaabd to c2fe5e0 Compare May 8, 2016 02:12

zhengruifeng changed the title ~~[SPARK-14340][DOC] Add Scala Example and User Guide for ml.BisectingKMeans~~ [SPARK-14340][DOC] Update Examples and User Guide for ml.BisectingKMeans May 8, 2016

zhengruifeng changed the title ~~[SPARK-14340][DOC] Update Examples and User Guide for ml.BisectingKMeans~~ [SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans May 8, 2016

MLnick reviewed May 9, 2016
View reviewed changes

zhengruifeng force-pushed the doc_bkm branch from e6bef11 to 6c0a8ce Compare May 9, 2016 13:09

sethah mentioned this pull request May 9, 2016

[SPARK-14434][ML]:User guide doc and examples for GaussianMixture in spark.ml #12788

Closed

sethah reviewed May 9, 2016
View reviewed changes

MLnick reviewed May 10, 2016
View reviewed changes

zhengruifeng force-pushed the doc_bkm branch from 6c0a8ce to 77e73c7 Compare May 11, 2016 02:24

MLnick reviewed May 11, 2016
View reviewed changes

zhengruifeng added 3 commits May 11, 2016 14:50

recreate pr

2f38676

del params in guide; add run cmds in comments

449c2f6

change cost str; del i in result

3adcea3

zhengruifeng force-pushed the doc_bkm branch from 77e73c7 to 3adcea3 Compare May 11, 2016 06:57

rename one var

8cd45d3

asfgit closed this in cef73b5 May 11, 2016

zhengruifeng deleted the doc_bkm branch May 11, 2016 08:06

[SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans #11844

[SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans #11844

Uh oh!

Conversation

zhengruifeng commented Mar 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 2, 2016

Uh oh!

zhengruifeng commented Apr 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented May 5, 2016

Uh oh!

MLnick commented May 5, 2016

Uh oh!

SparkQA commented May 5, 2016

Uh oh!

zhengruifeng commented May 5, 2016

Uh oh!

SparkQA commented May 5, 2016

Uh oh!

SparkQA commented May 5, 2016

Uh oh!

sethah commented May 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanboliang May 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented May 7, 2016

Uh oh!

zhengruifeng commented May 7, 2016

Uh oh!

MLnick commented May 7, 2016

Uh oh!

viirya commented May 7, 2016

Uh oh!

SparkQA commented May 8, 2016

Uh oh!

zhengruifeng commented May 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented May 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented May 9, 2016

Uh oh!

SparkQA commented May 9, 2016

Uh oh!

SparkQA commented May 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Mar 19, 2016 •

edited

Loading

yanboliang May 6, 2016 •

edited

Loading

zhengruifeng commented May 8, 2016 •

edited

Loading

MLnick May 10, 2016 •

edited

Loading

MLnick commented May 11, 2016 •

edited

Loading