This is code I wrote during my internship at Spoken Language Systems, Saarland University. I worked on hate speech detection, and the dataset used was the OffensEval hate speech corpus released in SemEval 2019. I took some references and code from these repos/websites:
The steps involved are roughly this, if you wish to reproduce the code:
- Place dataset (in csv/tsv format) in the /Data folder.
- Run
GenerateBERT.py
to generate BERT embeddings and save them in /pickles folder. - Run either of the three files in /Entity Extraction to generate entities/noun phrases from tweets. (The code in
Stanford.py
needs to be added to more.) - Run
TrainDoc2Vec.py
to train two Doc2Vec models based on Wiki corpus for Wikipedia embeddings. (Contact me for pretrained model files.) - Run
EntityEmbeddings.py
to generate embeddings from extracted entities using trained doc2vec models. - Finally, run either of
TrainSVM.py
andTrainRNN.py
to train different models and see the outcome.
Some tips:
- Adjust the paths for saving and loading files everywhere.
- In the code I took from the links above, I have made significant edits. Feel free to remove them/add more to them to further investigate the system.