Skip to content

Challenge is to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate. Data comprises of comments from Wikipedia’s talk page edits.

Notifications You must be signed in to change notification settings

Ashish-Gupta03/Toxic-Comment-Classification

Repository files navigation

Toxic-Comment-Classification

Challenge is to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate. Data comprises of comments from Wikipedia’s talk page edits. Approaches:-

  • There are 2 approaches used for it.
    • Basic LogisticRegression using words and characters ngrams.
    • Using Deep Learning BiLSTM and using Glove embeddings. Below is the explanation for BiLSTM, Glove embeddings approach.

I will be explaining second approach below as this gave me best results and is something new to try out.

Approach:-

  • Preprocess data
    • Remove null column values by any textual data as I did by using "fillna" in my code.
  • Select number of features, max length and embedding to use
    • max_features = 100000, maxlen = 150, embedding used = Glove
  • Use Tokenizer to tokenize text, convert them to sequences and pad the comments so that each one becomes = maxlen of 150 words.
  • Used Glove embedding to create embeddings of words which are in vocabulary.
  • Create model
    • Take Input of shape = maxlen
    • Create Embedding layer of max_features
    • Apply SpatialDropout
    • Use Bidirectional LSTM with 128 layers.
    • Apply Conv1D and reduced the number of layers to 64.
    • Apply GlobalAveragePooling1D and GlobalMaxPooling1D and concatenate them.
    • Apply Dropout
    • Use Dense Layer with activation as sigmoid
  • Compile the model with 'binary_crossentropy'.
  • Fit the model and run for 5 epochs(after that it converges).
  • Perform k fold with number of folds = 4.

For running code

For more details click the link https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

This model gave 0.9858 on Public LB and ensemble gave 0.9863 on Private LB. Ranked among Top 11%.

About

Challenge is to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate. Data comprises of comments from Wikipedia’s talk page edits.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published