Enhancing K-Means initialization options #286

weefuzzy · 2024-10-31T11:45:46Z

We're finding that SKMeans can perform quite poorly with the random partition method of initialization, insofar as it is prone to get stuck in local minima.

This can be a problem for k-means more generally (see [1]), depending on the distribution of the data: k-means's hard assignment means that a poor choice of initial centroids is more likely with random partition and unlikely to converge well in the subsequent iterations.

This PR explores improvements. First by adding an option to sk-means for 'Forgy' initialisation, which just initializes the means by sampling k points from the data. This seems to produce improvements for sk-means in some simple cases. However, it's still not the state of the art in initialization procedures, because it depends on how representative those k samples are.

More sophisticated schemes such as k-means++ build up a more accurate approximation of the data distribution and sample from that. k-means++ is expensive however. Fortunately there are now more efficient approximate schemes for this that use MCMC to approximate the distribution with only a single pass over the data.

This PR will stay WIP for a bit whilst we explore this in a controlled way. This initial commit aims at a minimal set of changes to skmeans to provide the choice of initialisations. I imagine the following steps:

verify that Forgy initialization can improve matters for sk-means
refactor k-means a bit so that it can offer a choice of init schemes too
develop and test MCMC initialisation for both k-means and sk-means

[1] Greg Hamerly and Charles Elkan. 2002. Alternatives to the k-means algorithm that find better clusterings. In Proceedings of the eleventh international conference on Information and knowledge management (CIKM '02). Association for Computing Machinery, New York, NY, USA, 600–607. https://doi.org/10.1145/584792.584890

tremblap · 2024-11-01T17:51:47Z

SC and docs added:

flucoma/flucoma-sc#178
flucoma/flucoma-docs#205

weefuzzy · 2024-11-02T13:54:21Z

You were underwhelmed with the improvement from Forgy initialisation on the 'rays' test dataset. Like random partition, just selecting k data points can still be arbitrarily bad, and how bad each of them is partly a function of the data distribution.

Following images are plots of 10 runs of random partition, Forgy, kmeans++ and the mcmc approximation against the rays data both centred and un-centred. The coloured points in each plot are the estimated centroids, and the grey points are the normalised data. Hopefully this sheds some light on how these different schemes compare.

Take home is: kmeans++ performs consistently very well, as expected. MCMC does almost as well (for an order of complexity less). Random partition is pretty bad, especially on the non-centred case. Forgy is pretty variable: it often does ok, but can do appallingly sometimes; in the centred case it's more prone to missing one of the clusters entirely, leaving much more work for the subsequent iteration to do.

Centred

Not centred

tremblap · 2024-11-03T09:20:43Z

this is an amazing explanation and demonstration. Thank you for taking the time.

tremblap · 2024-11-03T09:23:57Z

Another question: I find initialize to be ambiguous.... I feel the verb implying a bolean in our interface. In effect the full sentence would be initialisation method so i think initMethod might be more instructive to the musician reader. it gets me to think a bit about what is under the hood. then I could write a one liner for each of the available methods (we keep) to give the pros and cons.

SKMeans: Add option for 'Forgy' initialiaation

8b781eb

weefuzzy requested a review from tremblap October 31, 2024 11:45

weefuzzy self-assigned this Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing K-Means initialization options #286

Enhancing K-Means initialization options #286

weefuzzy commented Oct 31, 2024

tremblap commented Nov 1, 2024

weefuzzy commented Nov 2, 2024

tremblap commented Nov 3, 2024

tremblap commented Nov 3, 2024

Enhancing K-Means initialization options #286

Are you sure you want to change the base?

Enhancing K-Means initialization options #286

Conversation

weefuzzy commented Oct 31, 2024

tremblap commented Nov 1, 2024

weefuzzy commented Nov 2, 2024

Centred

Not centred

tremblap commented Nov 3, 2024

tremblap commented Nov 3, 2024