GitHub - JasonAltschuler/KMeansPlusPlus: Improvements on classical KMeans clustering

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src		src
README		README

Repository files navigation

/**********************************************************************
* @author Jason Altschuler
*
* Description of KMeans
**********************************************************************/

  ** PURPOSE: Clusters n-dimensional points.

  ** ALGORITHM: KMeans++, an improvement on the classical K-Means.

  ** COST FUNCTION: minimize Within-Cluster-Sum-of-Squares (WCSS)

WCSS := sum(sum(|C_i - X_j|))

   where C_i is the ith cluster {1 ... k},
         X_j is the jth point that is assigned to C_i

  ** OVERVIEW:
     - Step 1: Randomly chooses without replacement k (# of clusters)
      data points to be initial centroids, either by basic random
      sampling or by KMeans++ (described below).
     - Step 2: Cluster multiple times. Clustering consists of two
      repeating alternating steps.
       - Step 2a: Assignment step: assign points  to nearest centroids.
       - Step 2b: Update step: recalculate centroids based on assignments.
         - Steps 2a and 2b are repeated until the marginal improvement
         in WCSS (the cost function) is less than 'epsilon' (provided
         by user, or default = .001).
     - Step 3: Choose the clustering with the lowest WCSS.

  ** DESCRIPTION OF KMEANS++: An improved method for choosing initial
    centroids. Iteratively chooses the next centroid based on a
    weighted distribution, where the probability a data point is chosen
    is directly proportional to the square of the distance between that
    point and the nearest already chosen centroid.

  ** PARAMETERS / DATA STRUCTURES: for more description, see the
    documentation in the code itself.
Default and suggested values are provided directly below.

  ** REQUIRED PARAMETERS:
int k                -- number of clusters. Recommended: fewer is better.
                     See link 3 at bottom for suggestions on choosing k.
double[][] points    -- n-dimensional points.
                     For PhenoRipper, this is pixels by channels, and
                     entries stores pixel intensities.

  ** OPTIONAL PARAMETERS: default and suggested values
int iterations       -- Default: 50. Recommended [50, 1000]
boolean pp           -- Default: true. Recommended: true. KMeans++
                     typically converges faster than basic random sampling.
double epsilon       -- Default: .001. Recommended: [.0001 to .01].
                     Smaller -> more precision. Larger -> faster.
                     Remember to set 'useEpsilon' to true if using this.
boolean useEpsilon   -- Default: true. Recommended unless extremely exact
                     results needed. Otherwise, potentially  much slower.

  ** OTHER DATA STRUCTURES: calculated from dimension of points[][]
int m                -- # of data points. For PhenoRipper: # of pixels.
int n                -- # of dimensions. For PhenoRipper: # of channels.

  ** OUTPUT:
double[][] centroids -- coordinates of cluster centroids.
int[] assignment     -- ith point is assigned to assignment[i] cluster
double WCSS          -- Within-Cluster-Sum-Of-Squares. Cost function.

  ** EXAMPLE RUN:
double[][] points = User.provideData();         // you need to provide
int k = User.provideNumberOfCluster();          // you need to provide

KMeans example = new KMeans.builder(k, points)  // required
                     .iterations(50)            // optional
                     .pp(true)                  // optional
                     .epsilon(.001)             // optional
                     .useEpsilon(true)          // optional
                     .build();                  // required

double[][] centroids = example.getCentroids();  // cluster centroids
int[] assignment     = example.getAssignment(); // cluster assignments
double WCSS          = example.getWCSS();       // cost function

User.do_something_awesome_with_the_data();      // enjoy!


  ** SOURCES:
   1. http://en.wikipedia.org/wiki/K-means_clustering
   2. http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html
   3. http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set