Toxic Comment Classification

This repository holds the files related to a research project. I experimented with deep learning on the toxic comment classification data set.
The data set can be found at https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge.

GloVe: Global Vector Representation

The GloVe word embedding files used can be found at https://nlp.stanford.edu/projects/glove/. I specifically used http://nlp.stanford.edu/data/glove.840B.300d.zip.

Summary

I tested the following machine learning and deep learning algorithms on the data:

Logistic Regression (without embeddings)
Long Short Term Neural Networks (LSTMs)
Gated Recurrent Units (GRUs)
Convolutional Neural Networks (CNNs)
LSTM + CNNs (Best performance)
GRU + CNNs

Dependencies

You should have Python 3 (preferably 3.6) installed.

1. Numpy

https://www.numpy.org/

pip3 install numpy

2. Pandas

https://pandas.pydata.org

pip3 install pandas

3. Tensorflow

https://tensorflow.org

pip3 install tensorflow-gpu

4. Keras

https://keras.io

pip3 install keras

5. Jupyter Notebook

https://jupyter.org (Optional: If you want to run the ipython notebooks)

pip3 install jupyter

Running the scripts

After installing dependencies to run the scripts. If the data and the glove files are in the same folder as train.py simply use:

python3 train.py

Usage:

python3 train.py [-h] [--train-path TRAIN_PATH] [--test-path TEST_PATH]
                        [-m MODEL] [-g GLOVE_PATH] [--embed-size EMBED_SIZE]
                        [-b BATCH_SIZE] [--epochs EPOCHS] [--max_length MAX_LENGTH]
                        [--max_features MAX_FEATURES] [--save-weights SAVE_WEIGHTS]

  -h, --help            shows help message
  
  --train-path TRAIN_PATH
                        Path to training data
                        
  --test-path TEST_PATH
                        Path to testing data
                        
  -m MODEL, --model MODEL
                        Specify model to use
                        
  -g GLOVE_PATH, --glove-path GLOVE_PATH
                        Path to glove embeddings
                        
  --embed-size EMBED_SIZE
                        Embedding size
                        
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        Specify batch size (default 32)
                        
  --epochs EPOCHS       
                        Specify number of epochs (default 1)
                        
  --max_length MAX_LENGTH
                        Maximum length of string used in training (default 200)
                        
  --max_features MAX_FEATURES
                        Maximum number of features to use in training (default 6000)
                        
  --save-weights SAVE_WEIGHTS
                        Save trained weights

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Scripts		Scripts
LICENSE		LICENSE
README.md		README.md
ToxicGRU.ipynb		ToxicGRU.ipynb
ToxicLogistic.ipynb		ToxicLogistic.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toxic Comment Classification

GloVe: Global Vector Representation

Summary

Dependencies

1. Numpy

2. Pandas

3. Tensorflow

4. Keras

5. Jupyter Notebook

Running the scripts

Usage:

About

Releases

Packages

Languages

License

sklan/toxic-comment-classification

Folders and files

Latest commit

History

Repository files navigation

Toxic Comment Classification

GloVe: Global Vector Representation

Summary

Dependencies

1. Numpy

2. Pandas

3. Tensorflow

4. Keras

5. Jupyter Notebook

Running the scripts

Usage:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages