GitHub - kamalojasv181/Hostility-Detection-in-Hindi-Posts

Hostility Post Detection in Hindi

NOTE: We are currently in the process of updating the repo and scripts to resolve the issues along with some other changes as mentioned in this Section. The changes will be reflected pretty soon. Any other suggestions are warmly welcome.

Leveraging pre-trained Language models for Multidimensional Hostile post detection in Hindi - 3rd runner up at CONSTRAINT 2021 Shared Task 2 - Hostile Post Detection in Hindi, collocated with AAAI 2021.

This repo contains:

Code for Models
Trained models used in the final submission.
Setup instructions to reproduce results from the paper.

Some important links: arxiv, poster.

In order to cite, use the following BiBTeX code:

@misc{kamal2021hostility,
      title={Hostility Detection in Hindi leveraging Pre-Trained Language Models}, 
      author={Ojasv Kamal and Adarsh Kumar and Tejas Vaidhya},
      year={2021},
      eprint={2101.05494},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Authors: Ojasv Kamal, Adarsh Kumar and Tejas Vaidhya

Overview

Abstract

Hostile content on social platforms is ever increasing. This has led to the need for proper detection of hostile posts so that appropriate action can be taken to tackle them. Though a lot of work has been done recently in the English Language to solve the problem of hostile content online, similar works in Indian Languages are quite hard to find. This paper presents a transfer learning based approach to classify social media (i.e Twitter, Facebook, etc.) posts in Hindi Devanagari script as Hostile or Non-Hostile. Hostile posts are further analyzed to determine if they are Hateful, Fake, Defamation, and Offensive. This paper harnesses attention based pre-trained models fine-tuned on Hindi data with Hostile-Non hostile task as Auxiliary and fusing its features for further sub-tasks classification. Through this approach, we establish a robust and consistent model without any ensembling or complex pre-processing. We have presented the results from our approach in CONSTRAINT-2021 Shared Task on hostile post detection where our model performs extremely well with 3rd runner up in terms of Weighted Fine-Grained F1 Score

Key Contributions

We fine-tuned transformer based pre-trained, Hindi Language Models fordomain-specific contextual embeddings, which are further used in ClassificationTasks.
We incorporate the fine-tuned hostile vs. non-hostile detection model as anauxiliary model, and fuse it with the features of specific subcategory models(pre-trained models) of hostility category, with further fine-tuning.

Refer our paper for complete details.

Dependencies

Dependency	Version	Installation Command
Python	3.8	`conda create --name covid_entities python=3.8` and `conda activate covid_entities`
PyTorch, cudatoolkit	>=1.5.0, 10.1	`conda install pytorch==1.5.0 cudatoolkit=10.1 -c pytorch`
Transformers (Huggingface)	3.5.1	`pip install transformers==3.5.1`
Scikit-learn	>=0.23.1	`pip install scikit-learn==0.23.1`
Pandas	0.24.2	`pip install pandas==0.24.2`
Numpy	1.18.5	`pip install numpy==1.18.5`
Emoji	0.6.0	`pip install emoji==0.6.0`
Tqdm	4.48.2	`pip install tqdm==4.48.2`

Setup Instruction

Will be included soon

Trained Models

Our model weights used in the submission have been released now.

Model Performance

Performance of our best model, i.e. Auxiliary Indic Bert on the Test Dataset.

Approach	Hostile	Defamation	Fake	Hate	Offensive	Weighted
Baseline Model	0.8422	0.3992	0.6869	0.4926	0.4198	0.542
Indic-Bert Auxiliary Model	0.9583	0.42	0.7741	0.5725	0.6120	0.6250

Please note that due to the skewed nature of dataset (in terms of class imbalance and biasness), along with model here not being perfectly reproducible, slight variations in final state of the model may lead the results to vary by 0.02-0.03 f1 scores (as has been observed by us in different runs for same model). Refer this for more details.

Updates to be done

Resolve issues with main_multitask_learning.py
Some Minor changes in code and functions
Add Setup Instructions
Add corrected code for csv file generation
Colab Notebook on Usage

Miscellaneous

In case of any issues or any query, please contact Ojasv Kamal

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Dataset		Dataset
README.md		README.md
binary_classification.py		binary_classification.py
data_processing.py		data_processing.py
generate_csv.py		generate_csv.py
main_bin_classification.py		main_bin_classification.py
main_multitask_learning.py		main_multitask_learning.py
multitask_learning.py		multitask_learning.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hostility Post Detection in Hindi

Overview

Abstract

Key Contributions

Dependencies

Setup Instruction

Trained Models

Model Performance

Updates to be done

Miscellaneous

About

Releases 1

Packages

Contributors 2

Languages

kamalojasv181/Hostility-Detection-in-Hindi-Posts

Folders and files

Latest commit

History

Repository files navigation

Hostility Post Detection in Hindi

Overview

Abstract

Key Contributions

Dependencies

Setup Instruction

Trained Models

Model Performance

Updates to be done

Miscellaneous

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages