Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing K-Means initialization options #286

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

weefuzzy
Copy link
Member

We're finding that SKMeans can perform quite poorly with the random partition method of initialization, insofar as it is prone to get stuck in local minima.

This can be a problem for k-means more generally (see [1]), depending on the distribution of the data: k-means's hard assignment means that a poor choice of initial centroids is more likely with random partition and unlikely to converge well in the subsequent iterations.

This PR explores improvements. First by adding an option to sk-means for 'Forgy' initialisation, which just initializes the means by sampling k points from the data. This seems to produce improvements for sk-means in some simple cases. However, it's still not the state of the art in initialization procedures, because it depends on how representative those k samples are.

More sophisticated schemes such as k-means++ build up a more accurate approximation of the data distribution and sample from that. k-means++ is expensive however. Fortunately there are now more efficient approximate schemes for this that use MCMC to approximate the distribution with only a single pass over the data.

This PR will stay WIP for a bit whilst we explore this in a controlled way. This initial commit aims at a minimal set of changes to skmeans to provide the choice of initialisations. I imagine the following steps:

  • verify that Forgy initialization can improve matters for sk-means
  • refactor k-means a bit so that it can offer a choice of init schemes too
  • develop and test MCMC initialisation for both k-means and sk-means

[1] Greg Hamerly and Charles Elkan. 2002. Alternatives to the k-means algorithm that find better clusterings. In Proceedings of the eleventh international conference on Information and knowledge management (CIKM '02). Association for Computing Machinery, New York, NY, USA, 600–607. https://doi.org/10.1145/584792.584890

@weefuzzy weefuzzy self-assigned this Oct 31, 2024
@tremblap
Copy link
Member

tremblap commented Nov 1, 2024

SC and docs added:

flucoma/flucoma-sc#178
flucoma/flucoma-docs#205

@weefuzzy
Copy link
Member Author

weefuzzy commented Nov 2, 2024

You were underwhelmed with the improvement from Forgy initialisation on the 'rays' test dataset. Like random partition, just selecting k data points can still be arbitrarily bad, and how bad each of them is partly a function of the data distribution.

Following images are plots of 10 runs of random partition, Forgy, kmeans++ and the mcmc approximation against the rays data both centred and un-centred. The coloured points in each plot are the estimated centroids, and the grey points are the normalised data. Hopefully this sheds some light on how these different schemes compare.

Take home is: kmeans++ performs consistently very well, as expected. MCMC does almost as well (for an order of complexity less). Random partition is pretty bad, especially on the non-centred case. Forgy is pretty variable: it often does ok, but can do appallingly sometimes; in the centred case it's more prone to missing one of the clusters entirely, leaving much more work for the subsequent iteration to do.

Centred

rays_centroids_centered

Not centred

rays_centroids_uncentered

@tremblap
Copy link
Member

tremblap commented Nov 3, 2024

this is an amazing explanation and demonstration. Thank you for taking the time.

@tremblap
Copy link
Member

tremblap commented Nov 3, 2024

Another question: I find initialize to be ambiguous.... I feel the verb implying a bolean in our interface. In effect the full sentence would be initialisation method so i think initMethod might be more instructive to the musician reader. it gets me to think a bit about what is under the hood. then I could write a one liner for each of the available methods (we keep) to give the pros and cons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants