Towards Semi-Automatic Embedded Data Type Inferencing is a project that expands on the 3rd phase of an existing project known as the ML Data Prep Zoo. Our work is an approach using machine learning models to classify contextual data that requires further extraction. We have created our own dictionary for the data to be labelled by and is as follows:
- Numbers
- List
- Datetime
- Sentence
- URL
- Custom Object
The current dataset and results include ~541 datapoints with our best accuracy (~86%) coming from Random Forest. All scripts to help labelling have been written in Python and all model testing has been used in Jupyter Notebooks. Oue models come from using the Python library, scikit-learn, and have been validated using k-fold cross validation.
NOTE: All scripts have been moved into the Preprocessing_Scripts folder, however the paths to data have been hardcoded. To execute the script without error, move the files outside of the folder. The following section gives a brief summary of each script and it's usage. Order listed is the expected order of execution
tobelabelled_list_creator.py
: Short script file that creates two CSV's needed for the project. This script specifically looks through our original data file (not included in this repository) and looks for rows that have been labelled 'Usable with Extraction.' The two CSV's include:- needs_extraction.csv : Contains the original base featurization data of each row.
- record_ids.csv : Contains the
Record_id
andAttribute_name
of each row.
labeller_cli_tool.py
: CLI script that helped the manual labelling process of the project. The script would first display meaningful features of the row, ask the user what is the appropriate label, the specific reason (also part of a list detailed in our technical report), and document these results inlabelled_data.csv.
rulebased_auto_labeller.py
: Rulebased approach we've created. A small portion is dependent onpandas.Timestamp
object for recognizing Datetime rows. Details about our rulebased approach can be found in our technical report. The return CSV for the rulebased approach is found at rulebook_labelled.csv.add_features.py
: This script was made after our initial featurization in order to test new features that may improve our models. The return CSV is found at labelled_added.csv. This script adds the features related to stopword_total, whitespace_count, char_count, delim_count, has_url, has_date, and has_email.
NOTE: All notebooks have been moved into the Jupyter_Notebooks folder, however the paths to data have been hardcoded. To execute the notebook without error, move the files outside of the folder.
Model Comparison.ipynb
: Primary notebook for viewing best performing models in a single notebookFeature Testing.ipynb
:Extra notebook for a closer view of a models' misclassifications.knn.ipynb, Logistic Regression.ipynb, RBF-SVM.ipynb, RandomForest.ipynb
: Models used. Included are ablation results, k fold cross validation results, and best performing model with its predictions.