cross_validation.py
: Module containing implementation of hyperparameter validation and classifier training and scoringfeature_extraction.py
: Module containing implementations of the procedures required for extracting the different types of features from the processed tweet textsfeature_selection.py
: Module containing implementations of the feature normalization and selection proceduresplotting.py
: Script containing code to generate the plotspreprocessing.py
: Module containing helper functions for performing preprocessing on the tweet textresources.py
: Module containing resources reused in multiple scripts, for efficient importingserialization.py
: Module containing helper functions for (de)serializing and (de)compressing Python objects from/to local storagetraining.py
: Training script- Performing preprocessing and feature extraction on the training set of tweets
- Performing feature normalization and training a feature selection model using the training feature vectors
- Performing cross-validation on the classifiers to fit the hyperparameters and find the best model
run.py
: Main script- Performing preprocessing and feature extraction on the testing set of tweets
- Normalization and feature selection using the pre-fitted transformers
- Generating the baseline submission using the trained best classifier's predictions
files/
: Folder with intermediate datasets, serialized transformers and modelsresults/
: Folder with final scores, plots and submission file
-
Python 3.7 is installed
-
That the data is downloaded and extracted like :
├── data
│ ├── train_neg.txt
│ ├── train_pos.txt
│ ├── test_data.txt
- That at least these files are present in the
files/
folder (as it is now):
├── files
│ ├── best_model.gz
│ ├── test_dataset_reduced.tsv
Notes on dependencies
pandas
: Data structures and analysis tools. Used for convient representation of tabular data and results.seaborn
: Python visualization library based on matplotlib. Used for plotting.nltk
: NLP framework. Used for tweet preprocessing and feature extraction tasks.scikit-learn
: Machine learning framework. Used for the feature selection, cross-validation, training and evaluation of the standard machine learning classifiers.empath
: Tool for analyzing text across lexical categories. Used for feature extraction._pickle
andcompress_pickle
: Used for Python object serialization and compression to local storage.
From the root folder of the project
In order to directly generate the classic ML submission again (it is saved at results/submission.csv
):
python3 run.py
In order to perform the feature extraction and selection, cross-validation and classifier training on the train set again:
python3 training.py
In order to perform the feature extraction and selection on the test set again, delete the file test_dataset_reduced.tsv
currently present in the files/
folder and run again:
python3 run.py
- Louis Amaudruz
- Andrej Janchevski
- Timoté Vaucher