From e476b0a6a8270d937255b0334879aa065cbb22ec Mon Sep 17 00:00:00 2001
From: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Date: Wed, 25 Nov 2015 17:50:58 +0900
Subject: [PATCH] [SPARK-6518][MLlib][DOC] Add example code and user guide for
 bisecting k-means

---
 data/mllib/sample_bisecting_kmeans_data.txt | 20 +++++++++
 docs/mllib-clustering.md                    | 47 +++++++++++++++++++++
 docs/mllib-guide.md                         |  1 +
 3 files changed, 68 insertions(+)
 create mode 100644 data/mllib/sample_bisecting_kmeans_data.txt
diff --git a/data/mllib/sample_bisecting_kmeans_data.txt b/data/mllib/sample_bisecting_kmeans_data.txt
new file mode 100644
index 000000000000..ffc5de0d37f4
--- /dev/null
+++ b/data/mllib/sample_bisecting_kmeans_data.txt
@@ -0,0 +1,20 @@
+6.4,2.7,5.3,1.9
+5.8,2.6,4,1.2
+4.5,2.3,1.3,0.3
+5.7,2.8,4.1,1.3
+4.4,3,1.3,0.2
+4.4,2.9,1.4,0.2
+5.2,3.5,1.5,0.2
+7.1,3,5.9,2.1
+6,2.2,4,1
+5.1,3.7,1.5,0.4
+5.6,2.5,3.9,1.1
+5.1,3.5,1.4,0.3
+5.7,3,4.2,1.2
+5,3.6,1.4,0.2
+4.6,3.6,1,0.2
+5,3.5,1.3,0.3
+6.7,2.5,5.8,1.8
+5,2.3,3.3,1
+6.9,3.2,5.7,2.3
+6.8,3.2,5.9,2.3
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 8fbced6c87d9..fb1a6cfc8288 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -718,6 +718,53 @@ sameModel = LDAModel.load(sc, "myModelPath")
 
 </div>
 
+## Bisecting k-means
+
+Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering).
+Hierarchical clustering is one of the most commonly used  method of cluster analysis which seeks to build a hierarchy of clusters.
+Strategies for hierarchical clustering generally fall into two types:
+
+- Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
+- Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
+
+Bisecting k-means algorithm is a kind of divisive algorithms.
+Because it is too difficult to implement a agglomerative algorithm as a distributed algorithm on Spark.
+The implementation in MLlib has the following parameters:
+
+* *k* the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters.
+* *maxIterations* the max number of k-means iterations to split clusters (default: 20)
+* *minDivisibleClusterSize* the minimum number of points (if >= 1.0) or the minimum proportion of points (if < 1.0) of a divisible cluster (default: 1)
+* *seed* a random seed (default: hash value of the class name)
+
+**Examples**
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+The following code snippets can be executed in `spark-shell`.
+
+Refer to the [`BisectingKMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans) and [`BisectingKMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel) for details on the API.
+
+{% highlight scala %}
+import org.apache.spark.mllib.clustering.{BisectingKMeans, BisectingKMeansModel}
+import org.apache.spark.mllib.linalg.Vectors
+
+// Load and parse the data
+val data = sc.textFile("data/mllib/sample_bisecting_kmeans_data.txt")
+val parsedData = data.map(s => Vectors.dense(s.trim.split(',').map(_.toDouble)))
+
+// Cluster the data into the tree clusters using BisectingKMeans
+val model = new BisectingKMeans().setK(3).run(parsedData)
+
+// Output the compute cost and the cluster centers
+println(s"Compute Cost: ${model.computeCost(parsedData)}")
+model.clusterCenters.zipWithIndex.foreach { case (center, idx) =>
+  println(s"Cluster Center ${idx}: ${center}")
+}
+{% endhighlight %}
+</div>
+
+</div>
+
 ## Streaming k-means
 
 When data arrive in a stream, we may want to estimate clusters dynamically,
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 91e50ccfecec..390d96aaa624 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -48,6 +48,7 @@ We list major functionality from both below, with links to detailed guides.
   * [Gaussian mixture](mllib-clustering.html#gaussian-mixture)
   * [power iteration clustering (PIC)](mllib-clustering.html#power-iteration-clustering-pic)
   * [latent Dirichlet allocation (LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda)
+  * [bisecting k-means](mllib-clustering.html#bisecting-kmeans)
   * [streaming k-means](mllib-clustering.html#streaming-k-means)
 * [Dimensionality reduction](mllib-dimensionality-reduction.html)
   * [singular value decomposition (SVD)](mllib-dimensionality-reduction.html#singular-value-decomposition-svd)