Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions data/mllib/sample_bisecting_kmeans_data.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
6.4,2.7,5.3,1.9
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work to use the existing kmeans_data instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can. The reason why I added a new data file was that it was a little strange to use one for k-means for me. So, I'll remove the file and then use the sample data for k-means instead of it.

5.8,2.6,4,1.2
4.5,2.3,1.3,0.3
5.7,2.8,4.1,1.3
4.4,3,1.3,0.2
4.4,2.9,1.4,0.2
5.2,3.5,1.5,0.2
7.1,3,5.9,2.1
6,2.2,4,1
5.1,3.7,1.5,0.4
5.6,2.5,3.9,1.1
5.1,3.5,1.4,0.3
5.7,3,4.2,1.2
5,3.6,1.4,0.2
4.6,3.6,1,0.2
5,3.5,1.3,0.3
6.7,2.5,5.8,1.8
5,2.3,3.3,1
6.9,3.2,5.7,2.3
6.8,3.2,5.9,2.3
47 changes: 47 additions & 0 deletions docs/mllib-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -718,6 +718,53 @@ sameModel = LDAModel.load(sc, "myModelPath")

</div>

## Bisecting k-means

Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering).
Hierarchical clustering is one of the most commonly used method of cluster analysis which seeks to build a hierarchy of clusters.
Strategies for hierarchical clustering generally fall into two types:

- Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
- Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Bisecting k-means algorithm is a kind of divisive algorithms.
Because it is too difficult to implement a agglomerative algorithm as a distributed algorithm on Spark.
The implementation in MLlib has the following parameters:

* *k* the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters.
* *maxIterations* the max number of k-means iterations to split clusters (default: 20)
* *minDivisibleClusterSize* the minimum number of points (if >= 1.0) or the minimum proportion of points (if < 1.0) of a divisible cluster (default: 1)
* *seed* a random seed (default: hash value of the class name)

**Examples**

<div class="codetabs">
<div data-lang="scala" markdown="1">
The following code snippets can be executed in `spark-shell`.

Refer to the [`BisectingKMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans) and [`BisectingKMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel) for details on the API.

{% highlight scala %}
import org.apache.spark.mllib.clustering.{BisectingKMeans, BisectingKMeansModel}
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/mllib/sample_bisecting_kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.trim.split(',').map(_.toDouble)))

// Cluster the data into the tree clusters using BisectingKMeans
val model = new BisectingKMeans().setK(3).run(parsedData)

// Output the compute cost and the cluster centers
println(s"Compute Cost: ${model.computeCost(parsedData)}")
model.clusterCenters.zipWithIndex.foreach { case (center, idx) =>
println(s"Cluster Center ${idx}: ${center}")
}
{% endhighlight %}
</div>

</div>

## Streaming k-means

When data arrive in a stream, we may want to estimate clusters dynamically,
Expand Down
1 change: 1 addition & 0 deletions docs/mllib-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ We list major functionality from both below, with links to detailed guides.
* [Gaussian mixture](mllib-clustering.html#gaussian-mixture)
* [power iteration clustering (PIC)](mllib-clustering.html#power-iteration-clustering-pic)
* [latent Dirichlet allocation (LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda)
* [bisecting k-means](mllib-clustering.html#bisecting-kmeans)
* [streaming k-means](mllib-clustering.html#streaming-k-means)
* [Dimensionality reduction](mllib-dimensionality-reduction.html)
* [singular value decomposition (SVD)](mllib-dimensionality-reduction.html#singular-value-decomposition-svd)
Expand Down