japanese-toxic-dataset

"Proposal and Evaluation of Japanese Toxicity Schema" provides a schema and dataset for toxicity in the Japanese language.

The repository is structured as follows:

schema.md: Labeling schema.
data/subset.csv: A subset of the dataset used in "Proposal and Evaluation of Japanese Toxicity Schema". Annotation details are described in schema.md.

For Japanese

README_ja.md is written in Japanese.

data/subset.csv is composed of the following columns:

Column Name	Description
id	Sentence ID
text	Sentence
Not Toxic	Toxicity Level: Not Toxic
Hard to Say	Toxicity Level: Hard to Say
Toxic	Toxicity Level: Toxic
Very Toxic	Toxicity Level: Very Toxic
category_卑語	Toxicity Category: Dignity
category_差別	Toxicity Category: Discrimination
category_迷惑行為	Toxicity Category: Harassment
category_猥褻	Toxicity Category: Obscenity
category_出会い・プライバシー侵害	Toxicity Category: Privacy
category_違法行為	Toxicity Category: Illegal
category_偏向表現	Toxicity Category: Bias
annotation_num	Number of Annotators

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
LICENSE		LICENSE
README.md		README.md
README_ja.md		README_ja.md
schema.md		schema.md
schema_ja.md		schema_ja.md