You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+44
Original file line number
Diff line number
Diff line change
@@ -49,6 +49,50 @@ The data in the us-cities set is stored in the .csv text format
49
49
50
50

51
51
52
+
## Implementation
53
+
54
+
*The data is stored in the .CSV text format
55
+
*The algorithm is implemented in Java8
56
+
*A description of the most important methods is available in the JavaDoc appended to the report and in the code comments
57
+
58
+
## Input data
59
+
60
+
*Data file
61
+
* k -Number of neighbors to the knn algorithm - step 1 of the algorithm (desired number of groups)
62
+
* m - expected number of groups
63
+
* n - number of subgroups - step 2 of the algorithm
64
+
65
+
## Testing the algorithm
66
+
67
+
Testing the algorithm will consist calculating accuracy and purity for a single group as well as for the whole set.
68
+
69
+
Accuracy is calculated as the ratio of the number of correctly grouped points (according to the original membership in the states) to the number of all points.
70
+
The purity of the grouping will be calculated as the number of different states in the group.
71
+
72
+
It was assumed that the state is counted as present in the group when there are at least three points belonging to this state.
73
+
The influence of the number of closest neighbors will be investigated - the first part of the algorithm,
74
+
the number of subgroups - the 2nd step of the algorithm for the accuracy of grouping and the purity of groups.
75
+
76
+
77
+
## Conclusions
78
+
79
+
According to the conclusions presented in the article, the algorithm copes well with grouping distant groups as well as perfectly carries out classifications. For the sets of points close to each other, the algorithm is also doing very well.
80
+
81
+
The biggest problem is the selection of grouping parameters for the set on which the algorithm is performed.
82
+
83
+
For small sets, it must be taken into account that too large decrease in the k parameter responsible for the first phase of the algorithm will create many unorganized small clusters, while too high value of this parameter will result in clusters with too many points.
84
+
85
+
The tests show that the best accuracy of the algorithm was obtained for the k parameter equal to the original number of groups predicted. Choosing the k parameter too small is definitely more important for later propagation of errors than too high value of this parameter.
86
+
87
+
The initial number of groups, and thus the parameter responsible for the second phase of the algorithm's operation, significantly affects the quality of calculating the RI and RC metrics for each cluster. The smaller the value of this parameter, the more numerous the groups are. On the other hand the bigger k parameter is the smaller groups become in the last merge phase of the algorithm. Too small and too many points in the cluster may result in incorrect calculation of metrics, and thus incorrect connection of clusters into groups.
88
+
89
+
The number of points on which the algorithm will operate also affects the way it works - for the correct calculation of RC, RI and EC metrics, large enough clusters are needed to determine the relationships between them, and small enough that the group does not contain too many points originally to other clusters.
90
+
Considering the above, the algorithm works very well. The influence of the input parameters is consistent with the predictions and the article describing the operation of the algorithm.
0 commit comments