- Our Parallel and sequential code beats scikit-learn and pyclustering on reasonably large collections.
- Our parallel code, using shared memory leads to 10x-15x speedup.
For 1,00,000 points, we got some interesting results, our sequential code works way faster than any other. Below are the time taken and clusters produced for different packages and settings on 1,00,000 points dataset:
Sequential (s) | CUDA T4 (s) | Scikit-learn (s) | CUDA GTX 1650(s) | CUDA shared memory(s) | CUDA co-operative groups |
---|---|---|---|---|---|
154 | 879.9 | ~ | 896 | 15.1 | 0.25 |
~ |
Note: More than 50,000 points did not load in our laptops because scikit-learn implementation requires ~18 GB RAM. As a thrird-party comparison alternative, we chose pyclustering
which utilizes minimal RAM.
Using shared memory in _dissimilarities
and kMedoidsClusterAssignmentKernel
, we could significantly reduce the time while bearing minimal distortions to the clusters.
Number of points | K | No shared memory | Shared memory |
---|---|---|---|
1,00,000 | 10 | 879s | 15.1s |
50,000 | 10 | 223.3 | 2.26s |
75,000 | 10 | 445s | 5.7s |
We also experimented with co-operative groups and found out that for datasets with number of points less than the total number of threads on the GPU device, K-Medoids could be solved in O(1)
time complexity(this hypothesis was experimentally proven because average time for datasets ranging from 1000 to 100000 points was 0.267 seconds
)
Number of points | K | CUDA shared memory(s) |
---|---|---|
10000 | 5 | 0.42 |
10000 | 10 | 0.43 |
10000 | 15 | 0.41 |
We do not see much difference in time because we are parallelizing the operations dependent on K
completely, so the time constraints are effectively only due to N
.
-
Generating synthetic data
- We provide a python code in
points_generator.py
to generate synthetic data. The code generates a csv file with the generated data. It can be modified to generate data of different dimensions and sizes as required. - Run
python points_generator.py --file_name "4_clus_30_points.txt" --points 30 --k 4
to generate a file with 30 points and 4 clusters. The file will be saved as4_clus_30_points.txt
.
- We provide a python code in
-
K-Medoids using sequential C++ code
- Run
g++ -o compiled_seq_kmedoids kmedoids_sequential.cpp
to compile the code. - Run
./compiled_seq_kmedoids
to run the code. <input_file>
is the text file generated by thepoints_generator.py
code.<OUTPUT_DIR>
is the directory where the output files will be saved. These files include the final cluster assignments and the medoids inclusters.csv
,centroids.csv
anddatapoints.csv
.- To access the results in python, run the following code.
from utility import get_our_clustering df1, centroids1 = get_our_clustering("15_points") # Here "15_points" is the dir name
df1
will contain the points and their corressponding cluster id. So columns will bex
,y
,c
.
- Run
-
K-Medoids using our CUDA implementation
- Run
nvcc -o compiled_cuda_kmedoids kmedoids_cuda.cu
to compile the code. - Run
./compiled_cuda_kmedoids
to run the code. Currently we have to hardcode the variables in the code itself and compile again.
- Run
-
K-Medoids using scikit-learn-extra
- Use
get_sklearn_clustering
fromutility.py
to get the cluster assignments and medoids using scikit-learn-extra. - Run the code below
from utility import get_sklearn_clustering df2, centroids2 = get_sklearn_clustering("15_points.txt", 4)
- Use
-
Visualizing the results
- We provide a function
compare_cluster_plots
inutility.py
to compare two clustering results. It plots the clusters of the two clustering results side by side. - Run the code below to visualize the results.
from utility import compare_cluster_plots compare_cluster_plots(df1, df2, ["Scikit-learn", "Our implementation"])
- We provide a function