-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans #11844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #53615 has finished for PR 11844 at commit
|
docs/ml-clustering.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exchange the two lines above, it's better to give the overview of Bisecting k-means firstly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will fix this.
|
Test build #54755 has finished for PR 11844 at commit
|
|
cc @jkbradley |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use SparkSession builder pattern as per #12281 (comment)
|
@zhengruifeng I would like to update the |
|
Same goes for the |
|
Test build #57880 has finished for PR 11844 at commit
|
|
@MLnick OK. I will update BisectingKMeans examples (py/scala/java) in this PR to directly read the data file. |
|
Test build #57881 has finished for PR 11844 at commit
|
|
Test build #57890 has finished for PR 11844 at commit
|
|
@MLnick @zhengruifeng I am working on updating the KMeans examples and adding python. I will submit the PR soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in updating the Kmeans examples, I had the same issues that make this code a bit ugly. I came up with:
val vecAssembler = new VectorAssembler()
.setInputCols(Array("x", "y", "z"))
.setOutputCol("features")
val schema = StructType(Array(
StructField("x", DataTypes.DoubleType),
StructField("y", DataTypes.DoubleType),
StructField("z", DataTypes.DoubleType)))
val dataset = vecAssembler.transform(
spark.read
.format("csv")
.option("sep", " ")
.schema(schema)
.load("data/mllib/kmeans_data.txt"))I think it's a little bit better since we don't convert to RDD in what we claim is the "dataframe API," but I am not certain what is best. Thoughts? @MLnick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethah 's proposal is feasible. I'd like to use dataset with libsvm format, then we can load it use spark.read.format("libsvm"). We can get features with Vector type and feed them into model training directly. Although the dataset has label column but we don't use it actually. This will make the example more succinct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I propose we just use an existing LIBSVM example data file, or we can
create a new one from kmeans_example_data.
On Fri, 6 May 2016 at 10:08 Yanbo Liang notifications@github.com wrote:
In
examples/src/main/scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala
#11844 (comment):+/**
- * An example demonstrating a bisecting k-means clustering.
- * Run with
- * {{{
- * bin/run-example ml.BisectingKMeansExample
- * }}}
- */
+object BisectingKMeansExample {
+- def main(args: Array[String]): Unit = {
- // Creates a SparkSession
- val spark = SparkSession.builder.appName("BisectingKMeansExample").getOrCreate()
- //
$example on$ - // Crates a DataFrame
- val rowRDD = spark.read.text("data/mllib/kmeans_data.txt").rdd
@sethah https://github.com/sethah 's proposal is feasible. I'd like to
use dataset with libsvm format, then we can load it directly use
spark.read.format("libsvm"). We can get features with Vector type and
feed them into model training. Although the dataset has label column but we
don't use actually. This will make the example more succinct.—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/11844/files/e2aaabd318a76b6edc59a99cfbc0f6239c833c0c#r62299617
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Idea. I will create a libsvm file containing the data in data/mllib/kmeans_data and use it in examples of KMeans and BisectKMeans
|
There is something wrong: The WSSSE is always and if I run |
|
The features type after |
|
Seems like a potential issue with libsvm relation - cc @viirya this seems different from the other bug you fixed! This works: This throws error: But selecting label and features works: |
|
@MLnick I will take a look at this issue in these days. |
|
Test build #58081 has finished for PR 11844 at commit
|
|
@MLnick I updated the PRs of KMeans and BisectingKMeans to directly load data file |
docs/ml-clustering.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we intend to make the ml docs "complete" (as in, as complete as mllib), could we detail the parameters as is done in the doc for the mllib algorithm.
This will probably need to be done across the board (but the doc parity work will be covered as part of SPARK-14815)
|
@MLnick sorry to involve other peoples' commits into this. I had to recreate this pr. |
|
Test build #58143 has finished for PR 11844 at commit
|
|
Test build #58144 has finished for PR 11844 at commit
|
docs/ml-clustering.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to be listing the params in the ML user guides like was previously done in mllib. I also think this is hard to maintain: what if the default values change or new params are added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick Should the params be listed like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, perhaps leave out the params. We should be consistent with the rest of ml docs. But they themselves seem inconsistent - in some cases we list e.g. input / output columns, in many other cases we don't etc.
But we can discuss ml doc consistency on JIRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick I hadn't seen your comment suggesting to add params. I'm not super opposed to listing params, but I was leaning in favor of consistency between docs. I agree we can discuss this as another issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your point is valid - this would be a bit out of place in the ml docs. I also agree that is does add a burden of keeping params and defaults in sync with the code. There's a good argument that the param doc lives in the API docs (as it does now for ml). Still, there's also a decent argument for having more detailed docs on params in the user guide, though perhaps only for very important ones (like an initialization scheme, or algorithm type etc).
Indeed, scikit-learn user guide and API docs seem to follow this style (as an example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will remove it
|
@zhengruifeng Can you make it sharing with GMM? Once your PR is merged, I can change mine to use your data. Thanks! |
|
@wangmiao1981 Once this PR is merged, you can directly load the datafile in your PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to add the "run-with" instruction to the main doc string, e.g.
"""
A simple example demonstrating bisecting k-means clustering.
Run with:
bin/spark-submit examples/src/main/python/ml/bisecting_k_means_example.py
"""
|
@MLnick Thanks. Updated |
|
Test build #58314 has finished for PR 11844 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this consistent with KMeans? .e.g. System.out.println("Within Set Sum of Squared Errors = as per https://github.com/apache/spark/pull/12925/files#diff-a805bb5f394ef27cbb213325676c2007R56
All 3 examples can be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I will do it.
|
@zhengruifeng just a couple final comments to make these examples consistent with the KMeans examples. Then I think this is ready. |
|
LGTM pending jenkins |
|
Test build #58342 has finished for PR 11844 at commit
|
|
Merged to master and branch-2.0. Thanks! |
…ectingKMeans ## What changes were proposed in this pull request? 1, add BisectingKMeans to ml-clustering.md 2, add the missing Scala BisectingKMeansExample 3, create a new datafile `data/mllib/sample_kmeans_data.txt` ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11844 from zhengruifeng/doc_bkm. (cherry picked from commit cef73b5) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
What changes were proposed in this pull request?
1, add BisectingKMeans to ml-clustering.md
2, add the missing Scala BisectingKMeansExample
3, create a new datafile
data/mllib/sample_kmeans_data.txtHow was this patch tested?
manual tests