In an unsupervized learning setting, we are given a data matrix
The centroids are initialized either at random or using a smarter technique (e.g. see k-means++ in the next section). Then while the centroids are updated, the algorithm iterates :
- Commpute the distance between all the points to all the centroids.
- Assign the points a cluster given the smallest distance to the centroids.
- Update the centroids by averaging the points of the assigned cluster.
This algorithm is not computationally efficient as it compute
Kmeans++ init is a smart initialization method for unsupervized learning when the number of clusters is known / guessed.
It finds the centroids as follows :
-
- Choose one center uniformly at random among the data points.
-
- For each data point x not chosen yet, compute D(x), the distance between x and the nearest center that has already been chosen.
-
- Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
-
- Repeat Steps 2 and 3 until k centers have been chosen.
K-medoids is very similar to K-means but the algorithm only uses points of the dataset as centroids. And so at each step choose the centroid as the closest point in the dataset of the mean of the assigned points in the cluster (instead of simply taking this mean as in k-means).
In GMMs the centroids are represented as Gaussian distributions with mean
The algorithm then uses the EM algorithm to find the best fitting distribution parameters for the centroids. The EM algorithm simply iterates the following two steps :
-
The Expectation step : $$ r_{nk} = \frac{\pi_{k} \mathcal{N}\left(x_{n} \mid \mu_{k}, \sigma_{k}\right)}{\sum_{j=1}^{K} \pi_{j} \mathcal{N}\left(x_{n} \mid \mu_{j}, \sigma_{j}\right)} $$
$$N_{k} = \sum_{n=1}^{N} r_{n k}$$ -
The Maximization Step : $$ \boldsymbol{\mu}{k} =\frac{1}{N{k}} \sum_{n=1}^{N} r_{n k} \boldsymbol{x}{n}, \quad \boldsymbol{\Sigma}{k} =\frac{1}{N_{k}} \sum_{n=1}^{N} r_{n k}\left(\boldsymbol{x}{n}-\boldsymbol{\mu}{k}\right)\left(\boldsymbol{x}{n}-\boldsymbol{\mu}{k}\right)^{\top}, \quad \pi_{k} =\frac{N_{k}}{N} $$
The algorithm stops when either the maximum number of iteration or the Negative Log Likelihood stops reducing, the nll is in this context defined as follows : $$ -\sum_{n=1}^{N}\left[\log \left(\frac{1}{\sqrt{2 \pi \sigma^{2}}}\right)-\frac{\left(x_{n}-\mu\right)^{2}}{2 \sigma^{2}}\right] $$
The fuzzy C-means takes 2 variables as input :
- The number of clusters k
- The fuzzyness p
The weight matrix w of size n,k initialized randomly with condition that
And then the algorithm iterates the following until the centroids are not changing anymore:
- update the centroids :
$C_j = \dfrac{\sum_{i}^n w_{i,j}^p X_i}{\sum_{i}^n {i,j}^p}$ - update the weights :
$w_ij = \dfrac{ \dfrac{1}{dist(X_i,C_j)}^{\dfrac{1}{p-1}}}{ \sum_l^k \dfrac{1}{dist(X_i,C_l)}^{\dfrac{1}{p-1}}}$
A data point is then assigned a cluster given the probability given by the weight matrix.