To enable more research on multilingual Relation Extraction, we generate translations of the TAC relation extraction dataset using DeepL and Google Translate.
The dataset was created by members of the DFKI SLT team: Leonhard Hennig, Philippe Thomas, Sebastian Möller, Gabriel Kressin
What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
The instances of this dataset are sentences from the original TACRED dataset, which in turn are sampled from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges.
In total, there are 1,422,914 instances in the dataset, on average 118,576 per language, including backtranslations of the test split.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
Not applicable.
: the instance id of this sentence, astring
: the list of tokens of this sentence, alist
: the relation label of this instance, astring
classification label.subj_start
: the 0-based index of the start token of the relation subject mention, anìnt
: the 0-based index of the end token of the relation subject mention, exclusive, anìnt
: the NER type of the subject mention, among the types used in the Stanford NER system, astring
: the 0-based index of the start token of the relation object mention, anìnt
: the 0-based index of the end token of the relation object mention, exclusive, anìnt
: the NER type of the object mention, among 23 fine-grained types used in the Stanford NER system, astring
Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)?
Not applicable.
The target is the relation label.
Yes, the train/dev/test splits of the translated versions correspond to the original TACRED data splits
Instances are drawn from a potentially noisy web / newswire corpus. Relation labels were assigned using crowd workers, and have been shown to be partially erroneous, see e.g. Alt et al., 2020 and Stoica et al., 2021.
Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?
The dataset is self-contained.
Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)?
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?
The authors of the original TACRED dataset have not stated measures that prevent collecting sensitive or offensive text. Therefore, we do not rule out the possible risk of sensitive/offensive content in the translated data.
The Github repository contains the code to generate the dataset.
The dataset is used to train Relation Extraction models for evaluation of language-specific performance.
Please see
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
This dataset is generated from a dataset that is partially based on newswire data, licensed and distributed by the Linguistic Data Consortium ( Therefore, this dataset will also be distributed under an LDC license, and can only be used according to that license.
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
The dataset will be distributed via the Linguistic Data Consortium at this URL
The dataset can be downloaded from this URL as set of JSON files.
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?
This dataset is distributed under an LDC license.
Have any third parties imposed IP-based or other restrictions on the data associated with the instances?
See the LDC license terms.
Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?
The dataset is hosted at LDC.
Please open an issue in the Github repository
Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?
We do not plan to update the dataset. Labeling errors can be corrected by applying the patches made available by TACRED Revisited and/or Re-TACRED.