Lisa Hoek - Radboud University, Nijmegen
This is the Github repository of my Computing Science bachelor Thesis.
It contains the following Notebooks and Python programs:
- ObtainDMOZdata.ipynb: Google Colab Notebook reading DMOZ content.rdf.u8 and storing it in a dataframe
- Lisa_Thesis0_join.json: Zeppelin Notebook retrieving DMOZ and CC and executing their join
- Lisa_Thesis1_getWebContent.json: Zeppelin Notebook retrieving the Web page content for each URL
- Lisa_Thesis2_classifier(I,II,III).json: Zeppelin Notebooks, each containing the process of selecting the data and its labels, and the process of training, testing and evaluating the classifier
- plotMatrix-I,II,III: Python programs for visualisation of the confusion matrices