data
+-<dataset>
+-devices
+-central
+-devicesCells
+-consolidate
+-results
+-DBSCAN
+-config
script
sumData
getCluster
This directory has data to be processed
For organization, each dataset has its own directory named <dataset>
In this directory resides datasets from devices (CSV files)
This directory is used to store all data in the central node. If it doesn't exist it will be created at run time
This directory receive cells sent by each device.
For testing purposes it could be have raw data.
The script gather all csv cells data received on deviceCells directory and create a single file with all cells.
It could have raw data if the devices sent raw data for testing purposes.
This directory stores the results of clusterization algorithms.
- CSV Files
- SVG File if the clustering dataset has 2 dimensions.
Files created by DBSCAN algorithm
Store CSV configuration files
All the scripts to run the experiment are stored on script
directory.
The main script is complete.py
, that run all phases of process.
Below the help screen:
Options Description
-h Show this help
-d <dir> The data files gathering by devices will be found on <dir>/devices directory
-e <epsilon> Value of epsilon: default = 10
-m <cells> Minimum Cells (default: 3)
-f <force> Minimum Force (default: 150)
-r Don't draw rectangles
-g Don't draw edges
-p Draw points
-b Draw numbers
-x Don't use prefix on the
First of all you need to create a new directory to store dataset in a CSV format.
The CSV files must be stored in the devices
directory below <dataset>
directory you've just created.
The script runs sumData
for each CSV file simulating the process that runs at each device.
You'll need a configuration file that by default will be found at <dataset>/config
directory with name config-<dataset>.csv
This CSV file must be 4 lines:
- Header: the names of variables.
- Example:
X,Y,Id,Classification
- Example:
- Variable Identification.
- (C)lustered: variables to be clustered
- (N)ot clustered
- C(L)assification: The Ground Truth label (if exists) to test the clustering algorithm.
- Example:
C,C,N,L
- Max Values. Values to be used on linear normalization. This line must contain the max value of each column.
- Example:
30,30,0,0
- Example:
- Min Values. Values to be used on linear normalization. This line must contain the min value of each column.
- Example:
2.9,3.7,0,0
- Example:
Note: All lines must have the same number of columns.
The program sumData is stored on sumData/bin
directory and summarize data stored in <dataset>/devices/*.csv
and create a new CSV file with cells created by summarization.
- Input
<dataset>/devices/*.csv
- Outputs (
complete.py
script)<dataset>/central/devicesCells/cell-<dataset>-<seq>.csv
. Where<seq>
is a sequential number.<dataset>/central/devicesCells/point-<dataset>-<seq>.csv
. Note: this file is generated if use-p
option
- Parameter
-e
Epsilon parameter. The default value for Epsilon is 10.
The script gather all output files created by sumData
and create one single file.
- Inputs
<dataset>/central/devicesCells/cell-<dataset>-<seq>.csv
. Where<seq>
is a sequential number.<dataset>/central/devicesCells/point-<dataset>-<seq>.csv
if used-p
option.
- Outputs
<dataset>/central/consolidate/cells-<dataset>.csv
<dataset>/central/consolidate/cells-<dataset>.csv
if used-p
option.
With the cells gathered consolidated in one single file, the clustering algorithm take place.
It´s important to notice that the outputs generated by complete.py
script create different file names for different parameters. It´s useful to remember the parameters used when compare results.
The script create a prefix for output files that identify the Epsilon and the Force used to run the process.
The prefix has the format: ennnfd.dddd
where nnn
is the value of Epsilon and d.dddd
is the value of force. Example: if Epsilon = 35 and Force = 0.0777 the prefix will be e035f0.077
.
- Inputs
<dataset>/central/consolidate/cells-<dataset>.csv
<dataset>/central/consolidate/cells-<dataset>.csv
if used-p
option.
- Outputs (
complete.py
script)<dataset>/central/results/<prefix>-cells-<dataset>.csv
.<dataset>/central/results/<prefix>-points-<dataset>.csv
if used-p
option.<dataset>/central/results/<prefix>-points-<dataset>.svg
. If dimension = 2 it´s possible generate a plotting file (SVG), that could be opened in any browser.
- Clustering Parameters
-e
Epsilon parameter. The default value for Epsilon is 10.-f
Force parameter. If the force between cells is greater or equal then parameter the cells become together.-m
Minimum Cells. If a cluster is formed by less than parameter, the cluster is discarded.
- Plotting Parameters: used to configure SVG output:
-r
Don´t draw cells (rectangles)-g
Don´t draw edges that link the cells creating the clusters-p
Draw points. Used to compare raw data with clustering results.-b
Draw Numbers. Draw the labels of clusters inside the cells and label of ground-truth (classification column on raw data) inside points.
- Prefix parameter
-x
This option could be used when testing Epsilon and Force Parameters to don't generate a bunch of files for each group of parameters tested.
The output files are in the CSV format:
Field | Description |
---|---|
cell-id | Sequential number |
number-points | Same as cell-id |
CM-0 | Coordinates of center of mass |
CM-1 | |
... | |
CM-n | |
qty-cells-cluster | Quantity of cells of the cluster |
gGluster-label | Label of Cluster defined by gCluster |
ground-truth-cell-label |
Label of Ground Truth cluster (Class Column) The ground truth of cell is determined by the class of center of mass' closest point |
Field | Description |
---|---|
Coord-0 | Coordinates of point |
Coord-1 | |
... | |
Coord-n | |
gGluster-label | Label of Cluster defined by gCluster |
ground-truth-label |
Label of Ground Truth (Class. Column) |
The script DBSCAN.py
runs the DBSCAN clustering algorithm to compare results.
It´s possible run DBSCAN over the points (raw data) or over the center of mass of cells generated by sumData
program.
The goal is compare gCluster Algorithm with DBSCAN in two situations:
-
Run DBSCAN over the all data to compare centralized data and distributed data approaches.
-
Run DBSCAN over cell´s center of masses to verify if a conventional and mature algorithm has good performance over summarized data.
Options Description -h Show this help -d <dir> Directory of files -pr <pre> Prefix of files (e<epsilon>f<force (with 4 decimals)> - Ex. e014f0.1500) -t <opt> <opt> = c or p (for cells or points respectively) -e <value> Epsilon value -m <value> Min points value -l Print legend -x Don't create files with prefix
- Options
-d
is the<dataset>
directory-t
type of data: c for cells (center of mass of cells) or p for points (raw data).-pr
Prefix of files. Is the best way to ensure you are comparing correctly. The script uses this prefix to find out the input file
- DBSCAN parameters
-e
Epsilon value. It´s important to notice that due the normalization the distance between minimum and maximum values for all dimensions is one. So this is a good reference to choose a good value of Epsilon.
- Inputs
<dataset>/central/results/<prefix>-cells-<dataset>.csv
for cells (option-t c
)<dataset>/central/results/<prefix>-points-<dataset>.csv
for points (option-t p
)
- Outputs
<dataset>/central/DBSCAN/<prefix>-cells-DBSCAN-<dataset>.csv
for cells (option-t c
)<dataset>/central/DBSCAN/<prefix>-points-DBSCAN-<dataset>.csv
for points (option-t p
)
Field | Description |
---|---|
CM-0 | Coordinates of point or cell |
CM-1 | |
... | |
CM-n | |
gGluster-label | Label of Cluster defined by DBSCAN |
ground-truth-label |
Label of Ground Truth cluster (Class Column) |
The script validation.py
runs the Fowlkes and Mallows index to compare a result of an algorithm with their ground-truth.
Each point has the ground-truth and the label find out by algorithm to be evaluated.
So it's created a set of 2-combinations from a set of n points and for each pair define:
- ss (same/same) - the two points belong in the same cluster on both gCluster and Ground Truth
- sd (same/different) - the two points belong in the same cluster on gCluster and different clusters on Ground Truth"
- ds (different/same) - the two points belong in different clusters on gCluster and in the same cluster on Ground Truth"
- dd (different/different) - the points belong in different clusters on both partitions
To calculate the Fowlker and Mallows index:
m1 = ss + sd (number of pairs ss plus number of pairs sd) m2 = ds + dd (number of pairs ds plus number of pairs dd)
FM = ss / sqrt(m1.m2)
Options Description
-h Show this help
-d <dir> Directory of files
-m <file> File with map of indexes
-t <opt> <opt> = c or p (for cells or points respectively)
-pr <pre> Prefix of files
if gGluster pr = (e<epsilon (3 digits)>f<force (with decimals)> - Ex. e014f0.1500)
if DBSCAN pr = (e<epsilon (4 decimals)>m<minPts (with 3 digits)> - Ex. e0.1100m003)
-b Use this if you'll validate DBSCAN
- Options
-d
is the<dataset>
directory-m <file>
Map file (see below)-t <opt>
type of file (c)ell or (p)oint-pr <pre>
prefix of file. The same prefix of previous phases. This format is different between gCluster or DBSCAN. See help screen above.-b
indicate DBSCAN
- Inputs
- If
-t c
(type = cells, algo = gCluster):<dataset>/central/results/<prefix>-cells-result-<dataset>.csv
- If
-t p
(type = points, algo = gCluster):<dataset>/central/results/<prefix>-points-result-<dataset>.csv
- If
-t c -b
(type = cells, algo = DBSCAN):<dataset>/central/results/<prefix>-cells-DBSCAN-<dataset>.csv
- If
-t p -b
(type = points, algo = DBSCAN):<dataset>/central/results/<prefix>-points-DBSCAN-<dataset>.csv
- If
- Outputs
- Show at screen:
- values of ss, sd, ds, and dd
- value of FM Index
- Show at screen:
In this case the script will load the cells output of gGluster algorithm (file <dataset>/central/results/<prefix>-cells-result-<dataset>.csv
). See the file format here
It will compare the labels of clusters generated by gCluster and labels from Ground Truth.
To find out the Ground Truth label of the cell, for each cell we chose the closest point to the center of mass.
In the figure 1, the ground truth label of the cell is 22.
Figure 1: Cell label = 248, Ground Truth Label = 22
In the Figure 2, all the points from raw data have the same label, but gCluster didn't join the two graphs (green and blue), probably due the Force parameter choose.
But in the validation, the cells from clusters 20 and 13 will have the same ground truth.
Figure 2: Two different clusters find out by gClusters with the same ground truth
In this case the script will load the points generated by gGluster algorithm (file <dataset>/central/results/<prefix>-points-result-<dataset>.csv
). See the file format here.
When use type = points (-t p
), the script simulates all the points in the central node. To compare the same clustering algorithm running over the summarized data and raw data.
The idea is subtly different from the type = cell. To verify the cluster the algorithm found out to the point it just set the cluster label where the point belongs.
Examples:
- Figure 1: The points with Ground Truth label 22 will be set as label 43 for points that is inside cells with label 43, and -1 for points located on cells that doesn't belong to any cluster.
- Figure 2: Using the same idea, points from Ground Truth Label 2 will receive gCluster Labels 20, 13, and -1 depends on cell they are inserted into.
In this case the script will load the cells output of DBSCAN algorithm (file <dataset>/central/results/<prefix>-cells-DBSCAN-<dataset>.csv
). See the file format here
It will compare the labels of clusters generated by DBSCAN and labels from Ground Truth.
To find out the Ground Truth label of the cell, for each center of mass, the algorithm find out the closest point with distance equal or less then minPts. If there is no point close enough the the center of mass will receive the label -1.
In this case the script will load the points generated by DBSCAN algorithm (file <dataset>/central/results/<prefix>-points-DBSCAN-<dataset>.csv
). See the file format here.
When use type = points, the script simulates all the points in the central node. To compare the same clustering algorithm running over the summarized data and raw data.
As you can see on Figures 1 to 3, the labels are defined automatically by algorithms. So, to validation works, the labels must match, because it uses the labels to determine if two points belongs or not to the same cluster.
So it's necessary create a Map File to create a relation between the label created by algorithms and the label provided by the Ground Truth file.
As the label could change depends on parameters of algorithms, it's necessary create a map file for each parameter. So, the file will use the prefix to identify the parameters used by algos.
The map file is a CSV format and the following name is expected by validation.py
script:
<dataset>/<prefix>-map-<dataset>.csv
for gCluster tests<dataset>/<prefix>-DBSCAN-<dataset>.csv
for DBSCAN tests