This repository holds the files related to a research project. I experimented with deep learning on the toxic comment classification data set.
The data set can be found at https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge.
The GloVe word embedding files used can be found at https://nlp.stanford.edu/projects/glove/. I specifically used http://nlp.stanford.edu/data/glove.840B.300d.zip.
I tested the following machine learning and deep learning algorithms on the data:
- Logistic Regression (without embeddings)
- Long Short Term Neural Networks (LSTMs)
- Gated Recurrent Units (GRUs)
- Convolutional Neural Networks (CNNs)
- LSTM + CNNs (Best performance)
- GRU + CNNs
You should have Python 3 (preferably 3.6) installed.
pip3 install numpy
pip3 install pandas
pip3 install tensorflow-gpu
pip3 install keras
https://jupyter.org (Optional: If you want to run the ipython notebooks)
pip3 install jupyter
After installing dependencies to run the scripts. If the data and the glove files are in the same folder as train.py simply use:
python3 train.py
python3 train.py [-h] [--train-path TRAIN_PATH] [--test-path TEST_PATH]
[-m MODEL] [-g GLOVE_PATH] [--embed-size EMBED_SIZE]
[-b BATCH_SIZE] [--epochs EPOCHS] [--max_length MAX_LENGTH]
[--max_features MAX_FEATURES] [--save-weights SAVE_WEIGHTS]
-h, --help shows help message
--train-path TRAIN_PATH
Path to training data
--test-path TEST_PATH
Path to testing data
-m MODEL, --model MODEL
Specify model to use
-g GLOVE_PATH, --glove-path GLOVE_PATH
Path to glove embeddings
--embed-size EMBED_SIZE
Embedding size
-b BATCH_SIZE, --batch_size BATCH_SIZE
Specify batch size (default 32)
--epochs EPOCHS
Specify number of epochs (default 1)
--max_length MAX_LENGTH
Maximum length of string used in training (default 200)
--max_features MAX_FEATURES
Maximum number of features to use in training (default 6000)
--save-weights SAVE_WEIGHTS
Save trained weights