Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Mar 19, 2016

What changes were proposed in this pull request?

1, add BisectingKMeans to ml-clustering.md
2, add the missing Scala BisectingKMeansExample
3, create a new datafile data/mllib/sample_kmeans_data.txt

How was this patch tested?

manual tests

@SparkQA
Copy link

SparkQA commented Mar 19, 2016

Test build #53615 has finished for PR 11844 at commit 80fa565.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng zhengruifeng changed the title [Minor][DOC] Add BisectingKMeans to ml-clustering.md [Minor][DOC] Add Scala Example and Description for ml.BisectingKMeans Mar 31, 2016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exchange the two lines above, it's better to give the overview of Bisecting k-means firstly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will fix this.

@zhengruifeng zhengruifeng changed the title [Minor][DOC] Add Scala Example and Description for ml.BisectingKMeans [SPARK-14340][DOC] Add Scala Example and Description for ml.BisectingKMeans Apr 2, 2016
@SparkQA
Copy link

SparkQA commented Apr 2, 2016

Test build #54755 has finished for PR 11844 at commit 3bf4137.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

cc @jkbradley

@zhengruifeng zhengruifeng changed the title [SPARK-14340][DOC] Add Scala Example and Description for ml.BisectingKMeans [SPARK-14340][DOC] Add Scala Example and User Guide for ml.BisectingKMeans May 5, 2016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use SparkSession builder pattern as per #12281 (comment)

@MLnick
Copy link
Contributor

MLnick commented May 5, 2016

@zhengruifeng I would like to update the JavaKMeansExample to be in line with this one and the Scala KMeansExample. Also I prefer to just read the example data data/mllib/kmeans_data.txt, it just makes the examples more succinct.

@MLnick
Copy link
Contributor

MLnick commented May 5, 2016

Same goes for the JavaBisectingKMeansExample, read the example data.

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57880 has finished for PR 11844 at commit 22ccc7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

@MLnick OK. I will update BisectingKMeans examples (py/scala/java) in this PR to directly read the data file.

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57881 has finished for PR 11844 at commit ddd90b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57890 has finished for PR 11844 at commit e2aaabd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented May 5, 2016

@MLnick @zhengruifeng I am working on updating the KMeans examples and adding python. I will submit the PR soon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in updating the Kmeans examples, I had the same issues that make this code a bit ugly. I came up with:

val vecAssembler = new VectorAssembler()
      .setInputCols(Array("x", "y", "z"))
      .setOutputCol("features")

    val schema = StructType(Array(
      StructField("x", DataTypes.DoubleType),
      StructField("y", DataTypes.DoubleType),
      StructField("z", DataTypes.DoubleType)))

    val dataset = vecAssembler.transform(
      spark.read
      .format("csv")
      .option("sep", " ")
      .schema(schema)
      .load("data/mllib/kmeans_data.txt"))

I think it's a little bit better since we don't convert to RDD in what we claim is the "dataframe API," but I am not certain what is best. Thoughts? @MLnick

Copy link
Contributor

@yanboliang yanboliang May 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethah 's proposal is feasible. I'd like to use dataset with libsvm format, then we can load it use spark.read.format("libsvm"). We can get features with Vector type and feed them into model training directly. Although the dataset has label column but we don't use it actually. This will make the example more succinct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I propose we just use an existing LIBSVM example data file, or we can
create a new one from kmeans_example_data.

On Fri, 6 May 2016 at 10:08 Yanbo Liang notifications@github.com wrote:

In
examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala
#11844 (comment):

+/**

  • * An example demonstrating a bisecting k-means clustering.
  • * Run with
  • * {{{
  • * bin/run-example ml.BisectingKMeansExample
  • * }}}
  • */
    +object BisectingKMeansExample {
    +
  • def main(args: Array[String]): Unit = {
  • // Creates a SparkSession
  • val spark = SparkSession.builder.appName("BisectingKMeansExample").getOrCreate()
  • // $example on$
  • // Crates a DataFrame
  • val rowRDD = spark.read.text("data/mllib/kmeans_data.txt").rdd

@sethah https://github.com/sethah 's proposal is feasible. I'd like to
use dataset with libsvm format, then we can load it directly use
spark.read.format("libsvm"). We can get features with Vector type and
feed them into model training. Although the dataset has label column but we
don't use actually. This will make the example more succinct.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/11844/files/e2aaabd318a76b6edc59a99cfbc0f6239c833c0c#r62299617

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Idea. I will create a libsvm file containing the data in data/mllib/kmeans_data and use it in examples of KMeans and BisectKMeans

@zhengruifeng
Copy link
Contributor Author

@MLnick @sethah @yanboliang

There is something wrong:

val dataset = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val kmeans = new KMeans().setK(2).setFeaturesCol("features")
val model = kmeans.fit(dataset)
val WSSSE = model.computeCost(dataset)

The WSSSE is always 0, and the model.clusterCenters only contains empty zero-length vectors Array[org.apache.spark.mllib.linalg.Vector] = Array([], []).

and if I run dataset.select("features").show, it fails

java.lang.RuntimeException: Error while decoding: scala.MatchError: 31 (of class java.lang.Byte)
createexternalrow(if (isnull(input[0, vector])) null else newInstance(class org.apache.spark.mllib.linalg.VectorUDT).deserialize, StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true))
+- if (isnull(input[0, vector])) null else newInstance(class org.apache.spark.mllib.linalg.VectorUDT).deserialize
   :- isnull(input[0, vector])
   :  +- input[0, vector]
   :- null
   +- newInstance(class org.apache.spark.mllib.linalg.VectorUDT).deserialize
      :- newInstance(class org.apache.spark.mllib.linalg.VectorUDT)
      +- input[0, vector]

@zhengruifeng
Copy link
Contributor Author

The features type after spark.read.format("libsvm").load(..) is mllib.SparseVector.
DataSet can not handle mllib.SparseVector? Or KMeans ?

@MLnick
Copy link
Contributor

MLnick commented May 7, 2016

Seems like a potential issue with libsvm relation - cc @viirya this seems different from the other bug you fixed!

This works:

scala> val df = Seq((0.0, Vectors.sparse(10, Seq((1, 1.0))))).toDF("label", "features")
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> df.select("features").show
+--------------+
|      features|
+--------------+
|(10,[1],[1.0])|
+--------------+

This throws error:

scala> val df2 = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> df2.select("features").show
java.lang.RuntimeException: Error while decoding: scala.MatchError: 19 (of class java.lang.Byte)
createexternalrow(if (isnull(input[0, vector])) null else newInstance(class org.apache.spark.mllib.linalg.VectorUDT).deserialize, StructField(features,org.apache.spark.mllib.linalg.VectorUDT@f71b0bce,true))
...

But selecting label and features works:

scala> df2.select("label", "features").show
+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
...

@viirya
Copy link
Member

viirya commented May 7, 2016

@MLnick I will take a look at this issue in these days.

@SparkQA
Copy link

SparkQA commented May 8, 2016

Test build #58081 has finished for PR 11844 at commit d3a0be1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented May 8, 2016

@MLnick I updated the PRs of KMeans and BisectingKMeans to directly load data file data/mllib/sample_kmeans_data.txt.
However, the output results now are wrong.

@zhengruifeng zhengruifeng changed the title [SPARK-14340][DOC] Add Scala Example and User Guide for ml.BisectingKMeans [SPARK-14340][DOC] Update Examples and User Guide for ml.BisectingKMeans May 8, 2016
@zhengruifeng zhengruifeng changed the title [SPARK-14340][DOC] Update Examples and User Guide for ml.BisectingKMeans [SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans May 8, 2016
@viirya
Copy link
Member

viirya commented May 8, 2016

@MLnick I opened a PR #12986 for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we intend to make the ml docs "complete" (as in, as complete as mllib), could we detail the parameters as is done in the doc for the mllib algorithm.

This will probably need to be done across the board (but the doc parity work will be covered as part of SPARK-14815)

@zhengruifeng
Copy link
Contributor Author

@MLnick sorry to involve other peoples' commits into this. I had to recreate this pr.
I have updated it according to your comments. Thanks

@SparkQA
Copy link

SparkQA commented May 9, 2016

Test build #58143 has finished for PR 11844 at commit e6bef11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 9, 2016

Test build #58144 has finished for PR 11844 at commit 6c0a8ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to be listing the params in the ML user guides like was previously done in mllib. I also think this is hard to maintain: what if the default values change or new params are added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick Should the params be listed like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, perhaps leave out the params. We should be consistent with the rest of ml docs. But they themselves seem inconsistent - in some cases we list e.g. input / output columns, in many other cases we don't etc.

But we can discuss ml doc consistency on JIRA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick I hadn't seen your comment suggesting to add params. I'm not super opposed to listing params, but I was leaning in favor of consistency between docs. I agree we can discuss this as another issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your point is valid - this would be a bit out of place in the ml docs. I also agree that is does add a burden of keeping params and defaults in sync with the code. There's a good argument that the param doc lives in the API docs (as it does now for ml). Still, there's also a decent argument for having more detailed docs on params in the user guide, though perhaps only for very important ones (like an initialization scheme, or algorithm type etc).

Indeed, scikit-learn user guide and API docs seem to follow this style (as an example).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will remove it

@wangmiao1981
Copy link
Contributor

@zhengruifeng Can you make it sharing with GMM? Once your PR is merged, I can change mine to use your data. Thanks!

@zhengruifeng
Copy link
Contributor Author

@wangmiao1981 Once this PR is merged, you can directly load the datafile in your PR.

Copy link
Contributor

@MLnick MLnick May 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to add the "run-with" instruction to the main doc string, e.g.

"""
A simple example demonstrating bisecting k-means clustering.
Run with:
  bin/spark-submit examples/src/main/python/ml/bisecting_k_means_example.py
"""

@zhengruifeng
Copy link
Contributor Author

@MLnick Thanks. Updated

@SparkQA
Copy link

SparkQA commented May 11, 2016

Test build #58314 has finished for PR 11844 at commit 77e73c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this consistent with KMeans? .e.g. System.out.println("Within Set Sum of Squared Errors = as per https://github.com/apache/spark/pull/12925/files#diff-a805bb5f394ef27cbb213325676c2007R56

All 3 examples can be updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I will do it.

@MLnick
Copy link
Contributor

MLnick commented May 11, 2016

@zhengruifeng just a couple final comments to make these examples consistent with the KMeans examples. Then I think this is ready.

@MLnick
Copy link
Contributor

MLnick commented May 11, 2016

LGTM pending jenkins

@SparkQA
Copy link

SparkQA commented May 11, 2016

Test build #58342 has finished for PR 11844 at commit 8cd45d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented May 11, 2016

Merged to master and branch-2.0. Thanks!

asfgit pushed a commit that referenced this pull request May 11, 2016
…ectingKMeans

## What changes were proposed in this pull request?

1, add BisectingKMeans to ml-clustering.md
2, add the missing Scala BisectingKMeansExample
3, create a new datafile `data/mllib/sample_kmeans_data.txt`

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11844 from zhengruifeng/doc_bkm.

(cherry picked from commit cef73b5)
Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
@asfgit asfgit closed this in cef73b5 May 11, 2016
@zhengruifeng zhengruifeng deleted the doc_bkm branch May 11, 2016 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants