An R-repackaging of datasets useful for evaluating clustering methods. The source for most is http://cs.joensuu.fi/sipu/datasets
I would love to include additional clustering datasets, if folks would like to provide them or make a PR.
This vignette provides a simple overview of the datasets included in the package.
The S-sets are useful for testing how an algorithm handles cluster overlap.
The package contains three sets of high-dimensional data. The
visualizations below were made using my largeVis
package to reduce
each dataset to two dimensions, and the colors are the result of
applying the hdbscan
function within the package.
The Python sklearn.datasets
package includes functions for creating
toy datasets. I’ve ported a few of them.
library(clusteringdatasets)
blobs <- make_blobs()
plot(blobs$samples, col=rainbow(3)[blobs$labels])
moons <- make_moons(noise=0.04)
plot(moons$samples, col=rainbow(2)[moons$labels])