CDSC_AL: A Clustering-based Data Stream Classification framework using Active Learning
The "Supplemental Result.pdf" includes the results for comparison with semi-supervised methods using 5%, 15%, 20% labeled data. Also, the comparison results between supervised methods and CDSC-AL method with 5%, 15%, and 20% labeled data respectively.
There are two python codes with different settings for the benchmark data streams:
- The main_final_draft.py file is developed for arranging data streams to have abrupt drifts and run this code on
Synthetic-1, Synthetic-2, Sea, and Shuttle
- The main_final_draft4.py file is developed for simulating data streams with gradual concept drift and run this code on
KDD cup 99, Forest covtype, Gas Sensor Drift, MNIST, CiFAR-10
The two synthetic datasets (Synthetic-1 and Synthetic-2) are generated by the authors and thus we include them here. For the remaining seven datasets, it can found from the following links:
-
http://users.rowan.edu/ ∼polikar/nse.html
To run the "main_final_draft.py" or "main_final_draft4.py" code with different datasets, go to line 17 to change the name of dataset.
In line 11, the global variable label_ratio allows for users to change the proportion of labeled data in each incoming data chunk.
Two different evaluation metrics are used:
-
BAcc1Hist: A vector of the Balanced Classification Accuracy values for the entire data streams
-
F1Hist: A vector of the Macro-average values of the F1-score for the entire data streams
- Numpy
- Pandas
- Scikit-learn
- Scipy
For any use of this project, please refer to the following article:
- Yan, Xuyang and Homaifar, Abdollah and Sarkar, Mrinmoy and Girma, Abenezer and Tunstel, Edward. "A Clustering-based framework for Classifying Data Streams." In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI2021).