Enhancing K-Means initialization options #286
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We're finding that
SKMeans
can perform quite poorly with the random partition method of initialization, insofar as it is prone to get stuck in local minima.This can be a problem for k-means more generally (see [1]), depending on the distribution of the data: k-means's hard assignment means that a poor choice of initial centroids is more likely with random partition and unlikely to converge well in the subsequent iterations.
This PR explores improvements. First by adding an option to sk-means for 'Forgy' initialisation, which just initializes the means by sampling
k
points from the data. This seems to produce improvements for sk-means in some simple cases. However, it's still not the state of the art in initialization procedures, because it depends on how representative thosek
samples are.More sophisticated schemes such as k-means++ build up a more accurate approximation of the data distribution and sample from that. k-means++ is expensive however. Fortunately there are now more efficient approximate schemes for this that use MCMC to approximate the distribution with only a single pass over the data.
This PR will stay WIP for a bit whilst we explore this in a controlled way. This initial commit aims at a minimal set of changes to skmeans to provide the choice of initialisations. I imagine the following steps:
Forgy
initialization can improve matters for sk-means[1] Greg Hamerly and Charles Elkan. 2002. Alternatives to the k-means algorithm that find better clusterings. In Proceedings of the eleventh international conference on Information and knowledge management (CIKM '02). Association for Computing Machinery, New York, NY, USA, 600–607. https://doi.org/10.1145/584792.584890