This is a repository of data sets and metrics which may be used to evaluate clustering algorithms. The data sets include:
- Marek Gagolewski's collection (direct access to data). See
?load_gagolewski
for details about how to load the data sets. Quite a few of these are not very "naturalistic"; among the low-dimensional data sets, the "sipu" and "fcps" batteries arguably have a larger proportion of problems that one might imagine coming from real data.
The metrics to compare the agreement between two clusterings of the same data set include:
Both support only hard clustering; at present there is no metric for soft clusterings. ami
is recommended because
it does not require that both clusterings have the same number of clusters, and it scales reasonably well when there are many clusters.
Clustering.jl contains additional evaluation metrics. Its mutualinfo(x, y; normed=true)
is closest to this package's ami
; see the AMI paper for details on the differences.
ami
scores were computed for several algorithms in Clustering.jl on a subset of the Gagolewski collection (see demos/bench_clustering.jl
for details). The results are shown as a violin plot below:
Higher AMI score is better.
Some algorithms depend on computing the pairwise distance matrix, and thus scale poorly for large data sets. For these algorithms, any data set with more than 3000 points was excluded. Here were the number of benchmarks that were runnable for each algorithm:
5×2 DataFrame
Row │ Algorithm Count
│ Any Int64
─────┼──────────────────
1 │ kmeans 45
2 │ kmedoids 27
3 │ hclust 27
4 │ affprop 27
5 │ dbscan 26
Algorithms (or variants) marked with a *
are fully automated, and do not require anything beyond the input data. The remainding algorithms received one or more "hints" from the reference clustering (e.g., the number of clusters).
Some choices were made for the fully-automated variants:
dbscan
's hyperparameters were set to have2d
neighbors ind
dimensions, and the radius was set as the mean distance to the2d
th neighbor.hclust*
split at the longest gap between splits in the dendrogram.
There may be more optimal automation strategies than these. In particular, all algorithms can be automated by optimizing evaluation metrics, although one still needs to choose, e.g., the range of numbers of clusters considered.