Skip to content

Dataset Two

Seid Muhie Yimam edited this page Sep 22, 2023 · 1 revision

The 5js in Ethiopia: Amharic Hate Speech Data Annotation Using Toloka Crowdsourcing Platform

ICT4D-2022 Dataset: This dataset is a Crowdsourced dataset of 5.3k tweets collected for our paper entittled The 5js in Ethiopia: Amharic Hate Speech Data Annotation Using Toloka Crowdsourcing Platform. and presented in The 4th ICT4DA Conference 2022, Bahir Dar, Ethiopia. The 2nd Paper is presented in WiNLP2022 co-located with EMNLP2022, Abu Dhabi, United Arab Emirates. Each tweet in the datset is annotated with three annotators on Yandex Toloka Crowdsourcing platform

Abstract

This paper presents an Amharic hate speech annotation using the Yandex Toloka crowdsourcing platform. The dataset for this paper is collected from 5 consecutive years in Ethiopia (5Js), namely from 2018-2022, following the ‘June’ events in Ethiopia. Every June for the last five years, some events put the country in violence. Accordingly, we annotate 5,267 tweets, nearly 1k tweets every year. We explore the main challenges of crowdsourcing annotation for Amharic hate speech data collection using Toloka. We attain a Fliess kappa score of 0.34 using three independent annotators that annotate the tweets and the gold label is determined using majority voting. Using the datasets, we build different classification models using classical machine learning and deep learning approaches. The classical machine learning algorithms, LR and SVM, achieve an F1 score of 0.49, while NB archives an F1 score of 0.46. The deep learning algorithms (LSTM, BiLSTM, and CNN) achieve a similar F1 score, that is 0.44, which is the lowest performance of all models. The contextual embedding models, Am-FLAIR and Am-RoBERTa achieve F1 scores of 0.48 and 0.50 respectively. We publicly release the dataset, source code, and models with a permissive license.

Clone this wiki locally