ClusteringBenchmarks

This is a repository of data sets and metrics which may be used to evaluate clustering algorithms. The data sets include:

Marek Gagolewski's collection (direct access to data). See ?load_gagolewski for details about how to load the data sets. Quite a few of these are not very "naturalistic"; among the low-dimensional data sets, the "sipu" and "fcps" batteries arguably have a larger proportion of problems that one might imagine coming from real data.

The metrics to compare the agreement between two clusterings of the same data set include:

Both support only hard clustering; at present there is no metric for soft clusterings. ami is recommended because it does not require that both clusterings have the same number of clusters, and it scales reasonably well when there are many clusters.

Clustering.jl contains additional evaluation metrics. Its mutualinfo(x, y; normed=true) is closest to this package's ami; see the AMI paper for details on the differences.

Benchmarking Clustering.jl

ami scores were computed for several algorithms in Clustering.jl on a subset of the Gagolewski collection (see demos/bench_clustering.jl for details). The results are shown as a violin plot below:

Higher AMI score is better.

Some algorithms depend on computing the pairwise distance matrix, and thus scale poorly for large data sets. For these algorithms, any data set with more than 3000 points was excluded. Here were the number of benchmarks that were runnable for each algorithm:

5×2 DataFrame
 Row │ Algorithm  Count
     │ Any        Int64
─────┼──────────────────
   1 │ kmeans        45
   2 │ kmedoids      27
   3 │ hclust        27
   4 │ affprop       27
   5 │ dbscan        26

Algorithms (or variants) marked with a * are fully automated, and do not require anything beyond the input data. The remainding algorithms received one or more "hints" from the reference clustering (e.g., the number of clusters). Some choices were made for the fully-automated variants:

dbscan's hyperparameters were set to have 2d neighbors in d dimensions, and the radius was set as the mean distance to the 2dth neighbor.
hclust* split at the longest gap between splits in the dendrogram.

There may be more optimal automation strategies than these. In particular, all algorithms can be automated by optimizing evaluation metrics, although one still needs to choose, e.g., the range of numbers of clusters considered.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
demos		demos
deps		deps
docs		docs
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClusteringBenchmarks

Benchmarking Clustering.jl

About

Releases

Packages

Languages

License

HolyLab/ClusteringBenchmarks.jl

Folders and files

Latest commit

History

Repository files navigation

ClusteringBenchmarks

Benchmarking Clustering.jl

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages