DeTox-Dataset

This repository contains the DeTox-Dataset, a German Offensive Language and Conversation Analysis dataset. It arose from the research project DeTox (https://fz.h-da.de/detox/) in 2022, which targeted at detection of toxicity and aggression in comments in the internet.

Content

public dataset without comment texts (you can request the full version) as sqlite database
dataset description/documentation (html)
Jupyter Notebook to get started with the database in Python
annotation guidelines (original German version and English translation)

About the Dataset

This is a German offensive language and conversation analysis dataset ("DeTox-Dataset") containing 10,278 annotated Twitter comments. The data was collected in the first half of 2021. The comments were annotated by three annotators each with 12 different labels. The dataset contains all single annotations including meda data (e.g. annotation duration) as well as a proposed gold standard. For all details please refer to our paper.

Get Started

You have two options how to access the dataset:

get the full data by accessing the SQLite-database
get a light version from the .tsv-file

Accessing the SQLite-database

The full dataset comes in an SQLite database (DeTox-Dataset_public.zip). The database tables and attributes are described in the documentation file (DeTox-Dataset_doc.html). An example how to access the database using python is given below and in the Jupyter-Notebook.

import pandas as pd
from pathlib import Path
import sqlite3 as sqlite

database_path = Path("DeTox-Dataset_public.sqlite3")  # path to your database file
dbconnect = None
cursor = None

if not database_path.is_file():
    print(f"Database {database_path} does not exist. Creating a new database now ...")
try:
    # open database connection
    dbconnect = sqlite.connect(database_path, detect_types=sqlite.PARSE_DECLTYPES | sqlite.PARSE_COLNAMES)
    cursor = dbconnect.cursor()
    # Check Foreign-Key Constraints can be switched off if needed with the following line:
    # cursor.execute("PRAGMA foreign_keys = OFF;")
except sqlite.Error as e:
    # if errors occur
    print("Error %s:" % e.args[0])

# Database request
annotations = pd.read_sql_query("SELECT * from Goldstandard;", con=dbconnect)
annotations.head()

# close connection
dbconnect.close()

Accessing the .tsv-file

To get a first insight in our dataset you may look at the .tsv-file version (DeTox-Dataset_public.tsv). It contains the table Goldstandard of the database which contains only over the annotators averaged annotations for each comment. More details to each single annotation and annotation metadata is available in the database file.

The .tsv-file can be read in Python with the following code:

import pandas as pd

dataset = pd.read_csv("DeTox-Dataset_public.tsv", sep="\t")

Description of the columns

Column Name	Description
c_id	Twitter-IDthun of the comment.
c_text	Empty, the complete text is only available on request.
nb_annotators	Number of annotations of the comment.
dataset_id	For annotation the dataset was split in multiple smaller batches. This is the number of the batch were the comment was annotated.
duration	Average duration needed to annotate the comment.
incomp sentiment hatespeech criminal_rel threat extrem	Label for the categories incomprehensible, sentiment, hatespeech, criminal relevance, threat, and extremism averaged over all annotations for a comment. It can be understood as the percentage of annotators who labeled a comment belonging to the respective category. The value is a float in the range of 0 to 1. 0 means the category does not apply to the comment, 1 means it does.
p_86 ... p_241	States, if a comment is labeled criminal relevant under the given paragraph number in the StGB ("German Criminal Code"). The number is again averaged over all annotations for a comment.
expression_explicit expression_implicit	Count of the number of annotators who labeled the comment as explicit or implicit respectively.
toxi	Toxicity of the comment averaged over all annotations of the comment.
target_person target_group target_public	Count of the number of annotators who labeled the comments target as person, group or public.
discrim_job ... discrim_Ethnicity	Percentage of annotators who labeled the comment to be discriminating in the respective topic.

Request the complete dataset

Unfortunately, we can't publish the complete dataset including the commtents text here (this version here is missing the text, it only contains the Twitter-IDs of the comments). But we are happy to provide the complete data to you, if you send us an email to melanie.siegel@h-da.de describing in short for what you need the dataset.

Citation

If you use the dataset, please cite our respective paper "DeTox: A Comprehensive Dataset for German Offensive Language and Conversation Analysis", which was presented on the 6th Workshop on Online Abuse and Harms on 14th July 2022 as part of the NAACL conference.

ACL-Style:
Christoph Demus, Jonas Pitz, Mina Schütz, Nadine Probol, Melanie Siegel, and Dirk Labudde.

DeTox: A Comprehensive Dataset for German Offensive Language and Conversation Analysis. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH), pages 143–153, Seattle, Washington ( Hybrid). Association for Computational Linguistics.

BibTeX:

@inproceedings{demus-etal-2022-comprehensive,
    title = "DeTox: A Comprehensive Dataset for {G}erman Offensive Language and Conversation Analysis",
    author = {Demus, Christoph  and
      Pitz, Jonas  and
      Sch{\"u}tz, Mina  and
      Probol, Nadine  and
      Siegel, Melanie  and
      Labudde, Dirk},
    booktitle = "Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)",
    month = jul,
    year = "2022",
    address = "Seattle, Washington (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.woah-1.14",
    doi = "10.18653/v1/2022.woah-1.14",
    pages = "143--153",
    abstract = "In this work, we present a new publicly available offensive language dataset of 10.278 German social media comments collected in the first half of 2021 that were annotated by in total six annotators. With twelve different annotation categories, it is far more comprehensive than other datasets, and goes beyond just hate speech detection. The labels aim in particular also at toxicity, criminal relevance and discrimination types of comments.Furthermore, about half of the comments are from coherent parts of conversations, which opens the possibility to consider the comments{'} contexts and do conversation analyses in order to research the contagion of offensive language in conversations.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Annotation-Guidelines_de.pdf		Annotation-Guidelines_de.pdf
Annotation-Guidelines_engl.pdf		Annotation-Guidelines_engl.pdf
DeTox-Dataset_doc.html		DeTox-Dataset_doc.html
DeTox-Dataset_public.tsv		DeTox-Dataset_public.tsv
DeTox-Dataset_public.zip		DeTox-Dataset_public.zip
LICENSE		LICENSE
README.md		README.md
get-started.ipynb		get-started.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeTox-Dataset

Content

About the Dataset

Get Started

Accessing the SQLite-database

Accessing the .tsv-file

Request the complete dataset

Citation

About

Releases

Packages

Contributors 3

Languages

License

hdaSprachtechnologie/detox

Folders and files

Latest commit

History

Repository files navigation

DeTox-Dataset

Content

About the Dataset

Get Started

Accessing the SQLite-database

Accessing the .tsv-file

Request the complete dataset

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages