C++ implementation of the mean shift clustering algorithm with OpenMP.
In this README you will find instructions about how to compile the program and
how to execute it. In the directory Paper
there is a technical paper about
the algorithm and the comparison of the sequential version with the parallel
version.
If you are used to CLion, you could open the directory MeanShift
with the
IDE. You will find all the necessary build configurations.
Otherwise, go to the MeanShift
directory, create a build folder and change
directory to that folder:
cd Meanshift
mkdir build
cd build
Then you can compile the program in two versions:
- A sequential version:
cmake -DCMAKE_BUILD_TYPE=Release ../
make meanshift
- A parallel version, with OpenMP:
cmake -DCMAKE_BUILD_TYPE=Release DCMAKE_CXX_FLAGS=-fopenmp ../
make meanshift
The parallel version will be much faster, because it uses the OpenMP directives to perform multithreading operations.
The result of the compilation is an executable called meanshift
. It takes
as input a CSV file containing the points to cluster and the bandwidth
parameter of the mean shift algorithm. The points in the CSV file must be all
of the same dimension (each row of the CSV file must contains the same
number of columns) and there must be only numerical values. To get the
clustering result, you must append to the command the string --write-output <filename>
: the program will generate a file <filename>.csv
containing the
points where the last number of each row is the cluster to which the point
belongs.
In the directory ClusteringTests/datasets
there are some datasets to
perform some tests.
I strongly reccomend you to use the python script plot_clusters.py
located in the directory ClusteringTests
to execute the program and plot
the results. It requires Python >= 3.5, numpy and matplotlib.
plot_clusters.py
take as input the CSV file and the bandwidth parameter.
Pay attention where the meanshift
executable is located, you could set the
right path in the script plot_clusters.py
.
The dataset data8.csv
contains 10000 points distributed in three gaussian
blobs with standard deviation 1.
If we execute the mean shift algorithm with bandwidth 1 we get exactly the three clusters:
$ python plot_clusters.py datasets/data8.csv 1
File: datasets/data8.csv
Number of points: 10000
Number of dimensions: 3
Number of clusters: 3
Elapsed time: 6.66942 s
If we increase the bandwidth value to 10, then all the points will go in the same cluster:
$ python plot_clusters.py datasets/data8.csv 10
File: datasets/data8.csv
Number of points: 10000
Number of dimensions: 3
Number of clusters: 1
Elapsed time: 22.963 s
The file speeduptest
in the directoryClusteringTests
is a BASH script
that executes the sequential version and then executes different parallel
versions with a different number of threads to estimate the speedup obtained
by parallelizing the algorithm.
The results and the analysis of the speedup is described in the technical
paper in the directory Paper
.