datasets-knowledge-embedding

📝 A collection of common datasets used in knowledge embedding

Synopsis

This project collects different datasets used in various knowledge embedding related papers. It also standardizes the format of these datasets, making it easier to use them in the evaluation of new works.

The datasets can be downloaded from the release page.
For licensing information, please refer to the original dataset license file.

If you are using this collection of datasets please consider to start ⭐️ the project to support it.

Datasets format

Every subfolder in this repo is a single dataset.
Every folder contains the following 18 files.

File name	Description
`edges_as_text_{train,valid,test}.tsv`	These three files contain the three splits of the dataset where entities and relations are in a textual form (i.e. `italy locatedin europe`).
`edges_as_text_all.tsv`	The concatenation of `edges_as_text_train.tsv`, `edges_as_text_valid.tsv`, and `edges_as_text_test.tsv`.
`edges_as_id_{train,valid,test}.tsv`	These three files contain the three splits of the dataset where entities and relations are mapped to a numerical ID (i.e. `38 1 2`). Entities and relations that are more frequent are mapped to lower integers (e.g. the entity/relation with ID `0` is the most frequent entity/relation in the dataset).
`edges_as_id_all.tsv`	The concatenation of `edges_as_id_train.tsv`, `edges_as_id_valid.tsv`, and `edges_as_id_test.tsv`.
`map_entity_id_to_text.tsv`	This file contains the mapping from numerical IDs used for entities in `edges_as_id_.tsv` to the textual representation used in `edges_as_text_.tsv` (i.e. `38 italy, 2 europe`).
`map_relation_id_to_text.tsv`	This file contains the mapping from numerical IDs used for relations in `edges_as_id_.tsv` to the textual representation used in `edges_as_text_.tsv` (i.e `1 locatedin`).
`frequency_entities_{all,train,valid,test}.tsv`	These files contain the frequency of each entity in the various splits of the dataset.
`frequency_relations_{all,train,valid,test}.tsv`	These files contain the frequency of each relation in the various splits of the dataset.

Add a new dataset

If you want to add a new dataset to this collection, first you need to create three files called train.tsv, valid.tsv, and test.tsv containing respectively the edges for the three splits train, validation and test.
The files must contain tab-separated triples of the form (head entity, relation, tail entity).

Once you did this, you can simply process the three files with the following bash script.

bash build.sh train.tsv valid.tsv test.tsv .

The script uses the edgelist-mapper tool under the hood.

Datasets

The datasets are distributed in two formats, namely text-based and id-based (see the dataset format section for the difference).

COUNTRIES-S1

This dataset was introduced in On Approximate Reasoning Capabilities of Low-Rank Vector Spaces.
The link to the original dataset as released by the authors is unknown but a copy has been taken from here.

Entities	Relation Types	Edges	Train Edges	Validation Edges	Test Edges
271	2	1159	1111	24	24

COUNTRIES-S2

This dataset was introduced in On Approximate Reasoning Capabilities of Low-Rank Vector Spaces.
The link to the original dataset as released by the authors is unknown but a copy has been taken from here.

Entities	Relation Types	Edges	Train Edges	Validation Edges	Test Edges
271	2	1111	1063	24	24

COUNTRIES-S3

This dataset was introduced in On Approximate Reasoning Capabilities of Low-Rank Vector Spaces.
The link to the original dataset as released by the authors is unknown but a copy has been taken from here.

Entities	Relation Types	Edges	Train Edges	Validation Edges	Test Edges
271	2	1033	985	24	24

FB15K

This dataset was introduced in Translating Embeddings for Modeling Multi-relational Data.
The original dataset as released by the authors is available here.

Entities in this dataset are represented trough the Freebase ids (i.e. /m/07l450, /film/film/genre, /m/082gq). Since they are hard to read we are considering to map them to Wikipedia pages (i.e. The_Last_King_of_scotland_(film), /film/film/genre, War_film).

Entities	Relation Types	Edges	Train Edges	Validation Edges	Test Edges
14951	1345	592213	483142	50000	59071

FB15K-237

This dataset was introduced in Observed versus latent features for knowledge base and text inference.
The original dataset as released by the authors is available here.

Entities in this dataset are represented trough the Freebase ids (i.e. /m/07l450, /film/film/genre, /m/082gq). Since they are hard to read we are considering to map them to Wikipedia pages (i.e. The_Last_King_of_scotland_(film), /film/film/genre, War_film).

Entities	Relation Types	Edges	Train Edges	Validation Edges	Test Edges
14541	237	310116	272115	17535	20466

KINSHIP

This dataset was introduced in Learning systems of concepts with an infinite relational model.
The original dataset as released by the authors is available here.

Entities	Relation Types	Edges	Train Edges	Validation Edges	Test Edges
104	25	10686	8544	1068	1074

NATIONS

This dataset was introduced in Learning systems of concepts with an infinite relational model.
The original dataset as released by the authors is available here.

Entities	Relation Types	Edges	Train Edges	Validation Edges	Test Edges
14	55	1992	1592	199	201

UMLS

This dataset was introduced in Learning systems of concepts with an infinite relational model.
The original dataset as released by the authors is available here.

Entities	Relation Types	Edges	Train Edges	Validation Edges	Test Edges
135	46	6529	5216	652	661

WN18

This dataset was introduced in Translating Embeddings for Modeling Multi-relational Data.
The original dataset as released by the authors is available here.

In the original dataset, the entities are represented trough the WordNet offset id (i.e. 01257145 derivationally_related_form 07488875), but the version distributed here has the offsets mapped to WordNet synsets that can be read by the nltk library (i.e. sensual.s.02 derivationally_related_form sensuality.n.01).

Entities	Relation Types	Edges	Train Edges	Validation Edges	Test Edges
41105	18	151442	141442	5000	5000

WN18RR

This dataset was introduced in Convolutional 2D Knowledge Graph Embeddings.
The original dataset as released by the authors is available here.

In the original dataset, the entities are represented trough the WordNet offset id (i.e. 01257145 derivationally_related_form 07488875), but the version distributed here has the offsets mapped to WordNet synsets that can be read by the nltk library (i.e. sensual.s.02 derivationally_related_form sensuality.n.01).

Entities	Relation Types	Edges	Train Edges	Validation Edges	Test Edges
41105	11	93003	86835	3034	3134

YAGO3-10

This dataset was introduced in Convolutional 2D Knowledge Graph Embeddings.
The original dataset as released by the authors is available here.

Entities	Relation Types	Edges	Train Edges	Validation Edges	Test Edges
123182	37	1089040	1079040	5000	5000

Authors

Simone Primarosa - simonepri

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the license file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datasets-knowledge-embedding

Synopsis

Datasets format

Add a new dataset