Comprehensive Analysis of Freebase and Dataset Creation for Robust Evaluation of Knowledge Graph Link Prediction Models (ISWC 2023)
If you use our datasets in a scientific publication, we kindly request that you cite the following paper:
@inproceedings{shirvani2023comprehensive,
title={Comprehensive analysis of freebase and dataset creation for robust evaluation of knowledge graph link prediction models},
author={Shirvani-Mahdavi, Nasim and Akrami, Farahnaz and Saeef, Mohammed Samiul and Shi, Xiao and Li, Chengkai},
booktitle={International Semantic Web Conference},
pages={113--133},
year={2023},
organization={Springer}
}
This repository contains the dataset, preprocessing scripts, and experiment results of the paper Comprehensive Analysis of Freebase and Dataset Creation for Robust Evaluation of Knowledge Graph Link Prediction Models, where we lay out a comprehensive analysis of the challenges and impacts associated with three idiosyncrasies(Reverse Triples, Mediator Nodes, and Type System) of Freebase, a large-scale, open-domain knowledge graph on Knowledge Graph Completion tasks such as link prediction.
Freebase is amongst the largest public cross-domain KGs that store common facts. It possesses several data modeling idiosyncrasies rarely found in comparable datasets such as Wikidata, YAGO, and so on. Though closed in 2015, Freebase still serves as an important knowledge graph in intelligent tasks. We checked all full-length papers that use datasets commonly used for link prediction and were published in 12 top conferences during their latest versions, in 2022. The 12 conferences are AAAL, IJCAI, WWW, KDD, ICML, ACL, EMNLP, NAACL, SIGIR, NeurIPS, SIGMOD, and VLDB. That amounts to 53 papers. 48 out of the 53 papers used datasets produced from Freebase, while only 8 used datasets from Wikidata. The papers and the datasets used in the papers are listed in the file papers.xlsx.
Reverse Triples
When a new fact was included in Freebase, it would be added as a pair of reverse triples. For instance, (A Room With A View,167 /film/film/directed_by, James Ivory) and (James Ivory, film/director/film, A Room With A View) form a pair of reverse triples. They have the same semantic meaning.
Mediator Nodes
Mediator nodes, also called CVT nodes, are used in Freebase to represent n-ary relationships. The figure below shows a CVT node connected to an award, a nominee, and a work. This or similar approach is necessary for accurate modeling of the real world.
Type System
Freebase categorizes each topic into one or more types and each type into one domain. Furthermore, the triple instances satisfy pseudo constraints as if they are governed by a rigorous type system. Specifically, 1) given a node, its types set up constraints on the labels of its properties; the type segment in the label of an edge (which is different from the edge type) in most cases belongs to one of the types of the subject node. 2) Given an edge type and its edge instances, there is almost a function that maps from the edge type to a type that all subjects in the edge instances belong to, and similarly almost such a function for objects.
Four variants of the Freebase dataset are provided by the inclusion/exclusion of various data modeling idiosyncrasies, which enables researchers to leverage or avoid such features based on the nature of their tasks. The dataset can be downloaded from this link.
variant | CVT nodes | reverse triples | #entities | #properties | #triples |
---|---|---|---|---|---|
FB-CVT-REV | removed | removed | 46,069,321 | 3,055 | 125,124,274 |
FB-CVT+REV | removed | retained | 46,077,533 | 5,028 | 238,981,274 |
FB+CVT-REV | retained | removed | 59,894,890 | 2,641 | 134,213,735 |
FB+CVT+REV | retained | retained | 59,896,902 | 4,425 | 244,112,599 |
The dataset consists of the four variants of the Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:
- Subject matter triples file
- fb+/-CVT+/-REV
One folder for each variant. In each folder, there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt
Subject matter triples are the triples that belong to subject matter domains—domains describing real-world facts.
- Example of a row in train.txt, valid.txt, and test.txt
-
2, 192, 0
-
- Example of a row in entity2id.txt:
- /g/112yfy2xr, 2
- Example of a row in relation2id.txt:
- /music/album/release_type, 192
- Explanation
- "/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
- Example of a row in train.txt, valid.txt, and test.txt
- fb+/-CVT+/-REV
One folder for each variant. In each folder, there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt
Subject matter triples are the triples that belong to subject matter domains—domains describing real-world facts.
- Type system file
- freebase_endtypes: Each row maps an edge type to its required subject type and object type.
- Example
-
92, 47178872, 90
-
- Explanation
- "92" and "90" are the type id of the subject and object which has the relationship id "47178872".
- Example
- freebase_endtypes: Each row maps an edge type to its required subject type and object type.
- Metadata files
- object_types: Each row maps the MID of a Freebase object to a type it belongs to.
- Example
-
/g/11b41c22g, /type/object/type, /people/person
-
- Explanation
- The entity with MID "/g/11b41c22g" has a type "/people/person"
- Example
- object_names: Each row maps the MID of a Freebase object to its textual label.
- Example
-
/g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en
-
- Explanation
- The entity with MID "/g/11b78qtr5m" has the name "Viroliano Tries Jazz" in English.
- Example
- object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
- Example
-
/m/05v3y9r, /type/object/id, "/music/live_album/concert"
-
- Explanation
- The entity with MID "/m/05v3y9r" can be interpreted by humans as a music concert live album.
- Example
- domains_id_label: Each row maps the MID of a Freebase domain to its label.
- Example
-
/m/05v4pmy, geology, 77
-
- Explanation
- The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
- Example
- types_id_label: Each row maps the MID of a Freebase type to its label.
- Example
-
/m/01xljxh, /government/political_party, 147
-
- Explanation
- The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
- Example
- entities_id_label: Each row maps the MID of a Freebase entity to its label.
- Example
-
/g/11b78qtr5m, Viroliano Tries Jazz, 2234
-
- Explanation
- The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
- Example
- properties_id_label: Each row maps the MID of a Freebase property to its label.
- Example
-
/m/010h8tp2, /comedy/comedy_group/members, 47178867
-
- Explanation
- The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has the label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
- Example
- uri_original2simplified and uri_simplified2original: The mapping between the original URI and simplified URI and the mapping between simplified URI and original URI respectively.
- Example
-
uri_original2simplified
-
"http://rdf.freebase.com/ns/type.property.unique": "/type/property/unique"
(the URI directs to nothing because Freebase has been closed)
-
-
uri_simplified2original
-
"/type/property/unique": "http://rdf.freebase.com/ns/type.property.unique"
(the URI directs to nothing because Freebase has been closed)
-
-
- Explanation
- The URI "http://rdf.freebase.com/ns/type.property.unique" in the original Freebase RDF dataset is simplified into "/type/property/unique" in our dataset.
- The identifier "/type/property/unique" in our dataset has URI http://rdf.freebase.com/ns/type.property.unique in the original Freebase RDF dataset.
- Example
- object_types: Each row maps the MID of a Freebase object to a type it belongs to.
We conducted all the link prediction experiments on four datasets using the DGL-KE framework (Zheng et al.,2020).
The hyperparameters used for each experiment, its training/test time, and more details can be found in the script provided for each dataset.
The results of these experiments on our datasets are shown in the table below.
FB-CVT-REV | FB-CVT+REV | FB+CVT-REV | FB+CVT+REV | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model | MRR | MR | H1 | H10 | MRR | MR | H1 | H10 | MRR | MR | H1 | H10 | MRR | MR | H1 | H10 |
TransE | 0.806 | 5.869 | 0.757 | 0.884 | 0.976 | 1.529 | 0.968 | 0.988 | 0.781 | 4.850 | 0.708 | 0.902 | 0.970 | 1.464 | 0.957 | 0.989 |
DistMult | 0.703 | 70.498 | 0.664 | 0.775 | 0.952 | 9.239 | 0.941 | 0.970 | 0.612 | 81.841 | 0.562 | 0.704 | 0.927 | 12.924 | 0.913 | 0.951 |
ComplEx | 0.719 | 67.740 | 0.684 | 0.783 | 0.958 | 8.437 | 0.950 | 0.972 | 0.624 | 83.205 | 0.577 | 0.708 | 0.928 | 13.278 | 0.915 | 0.951 |
TransR | 0.663 | 58.553 | 0.620 | 0.743 | 0.944 | 5.982 | 0.931 | 0.967 | 0.640 | 47.524 | 0.580 | 0.754 | 0.935 | 6.071 | 0.916 | 0.969 |
RotatE | 0.804 | 75.721 | 0.780 | 0.845 | 0.962 | 10.431 | 0.956 | 0.974 | 0.736 | 68.436 | 0.699 | 0.807 | 0.948 | 10.263 | 0.938 | 0.969 |
Another way of evaluating embedding models is to find their performance on triple classification. This task is the binary classification of triples regarding whether they are true or false facts. The results of our triple classification task are shown in the tables below.
consistent h | inconsistent h | |||||||
---|---|---|---|---|---|---|---|---|
Model | Precision | Recall | Acc | F1 | Precision | Recall | Acc | F1 |
RESCAL | 0.59 | 0.37 | 0.55 | 0.45 | 0.95 | 0.83 | 0.89 | 0.89 |
TransE | 0.52 | 0.59 | 0.52 | 0.55 | 0.81 | 0.69 | 0.76 | 0.74 |
DistMult | 0.53 | 0.51 | 0.53 | 0.52 | 0.94 | 0.87 | 0.91 | 0.90 |
ComplEx | 0.54 | 0.48 | 0.53 | 0.51 | 0.94 | 0.88 | 0.91 | 0.91 |
ConvE | 0.54 | 0.53 | 0.54 | 0.53 | 0.57 | 0.72 | 0.59 | 0.64 |
RotatE | 0.52 | 0.53 | 0.52 | 0.52 | 0.89 | 0.83 | 0.87 | 0.86 |
consistent t | inconsistent t | |||||||
Model | Precision | Recall | Acc | F1 | Precision | Recall | Acc | F1 |
RESCAL | 0.64 | 0.45 | 0.60 | 0.53 | 0.95 | 0.86 | 0.91 | 0.90 |
TransE | 0.58 | 0.54 | 0.57 | 0.56 | 0.90 | 0.82 | 0.86 | 0.86 |
DistMult | 0.59 | 0.55 | 0.58 | 0.57 | 0.95 | 0.89 | 0.92 | 0.92 |
ComplEx | 0.60 | 0.56 | 0.59 | 0.58 | 0.95 | 0.90 | 0.93 | 0.92 |
ConvE | 0.62 | 0.41 | 0.58 | 0.49 | 0.95 | 0.83 | 0.89 | 0.88 |
RotatE | 0.60 | 0.47 | 0.58 | 0.53 | 0.87 | 0.78 | 0.83 | 0.82 |
The experiments on triple classification were done using the LibKGE framework. (Broscheit et al.,2020)
-
parse_triples.sh script is used for URI simplification.
-
FBDataDump.sh is a script that runs parse_triples.sh and creates different MySQL tables from Freebase data dump. For example, tables for domains, types, properties, and entities. Command to run FBDataDump.sh:
./FBDataDump.sh mysql_username mysql_password
After running FBDataDump.sh, you may want to run one of the four scripts provided for each variant. All these four scripts detach the subject matter triples from the metadata and administrative triples. In addition, all these scripts create a type system for the final dataset. Command to run FBx.sh, where x ∈ {1,2,3,4}:
./FBx.sh mysql_username mysql_password
-
If you need to remove all the reverse triples as well as all the CVT nodes, you can run FB1.sh.
-
To keep the reverse triples but remove the CVT nodes, you can run FB2.sh.
-
To keep the CVT nodes but to remove the reverse triples, you can run FB3.sh.
-
To keep both CVT nodes and reverse triples, you can run script FB4.sh.
We did experiments on the four variants of Freebase as well as FB15K and FB-15K-237 using link prediction models like TransE, DistMult, ComplEx, RotatE, etc. The scripts to run the experiments are at ExperimentsScripts/ ending with .sh. An example of running the DistMult model on FB1 is as below.
dglke_train --model_name DistMult --dataset Freebase --data_path ./data --format udd_hrt \ --data_files entity2id.txt relation2id.txt train.txt valid.txt test.txt --batch_size 1024 --neg_sample_size 256 --hidden_dim 400 --gamma 143.0 \ --lr 0.08 --batch_size_eval 1000 --test -adv --mix_cpu_gpu --num_proc 8 --gpu 0 1 --max_step 300000 --neg_sample_size_eval 1000 \ --eval_interval 100000 --log_interval 1000 --async_update --rel_part --force_sync_interval 10000 --num_thread 4 --no_save_emb --delimiter ,
Please feel free to check out another paper of ours related to this topic: Realistic re-evaluation of knowledge graph completion methods: An experimental study
The dataset and code are made available under the CC0 1.0 Universal.
Note: Freebase Data Dumps is provided free of charge for any purpose. It is distributed under the Creative Commons Attribution (aka CC-BY) and the usage is subject to the Terms of Service.