Skip to content

idirlab/freebases

Repository files navigation

Comprehensive Analysis of Freebase and Dataset Creation for Robust Evaluation of Knowledge Graph Link Prediction Models (ISWC 2023)

Cite

If you use our datasets in a scientific publication, we kindly request that you cite the following paper:

@inproceedings{shirvani2023comprehensive,
  title={Comprehensive analysis of freebase and dataset creation for robust evaluation of knowledge graph link prediction models},
  author={Shirvani-Mahdavi, Nasim and Akrami, Farahnaz and Saeef, Mohammed Samiul and Shi, Xiao and Li, Chengkai},
  booktitle={International Semantic Web Conference},
  pages={113--133},
  year={2023},
  organization={Springer}
}

Overview

This repository contains the dataset, preprocessing scripts, and experiment results of the paper Comprehensive Analysis of Freebase and Dataset Creation for Robust Evaluation of Knowledge Graph Link Prediction Models, where we lay out a comprehensive analysis of the challenges and impacts associated with three idiosyncrasies(Reverse Triples, Mediator Nodes, and Type System) of Freebase, a large-scale, open-domain knowledge graph on Knowledge Graph Completion tasks such as link prediction.

Freebase is amongst the largest public cross-domain KGs that store common facts. It possesses several data modeling idiosyncrasies rarely found in comparable datasets such as Wikidata, YAGO, and so on. Though closed in 2015, Freebase still serves as an important knowledge graph in intelligent tasks. We checked all full-length papers that use datasets commonly used for link prediction and were published in 12 top conferences during their latest versions, in 2022. The 12 conferences are AAAL, IJCAI, WWW, KDD, ICML, ACL, EMNLP, NAACL, SIGIR, NeurIPS, SIGMOD, and VLDB. That amounts to 53 papers. 48 out of the 53 papers used datasets produced from Freebase, while only 8 used datasets from Wikidata. The papers and the datasets used in the papers are listed in the file papers.xlsx.

Reverse Triples

When a new fact was included in Freebase, it would be added as a pair of reverse triples. For instance, (A Room With A View,167 /film/film/directed_by, James Ivory) and (James Ivory, film/director/film, A Room With A View) form a pair of reverse triples. They have the same semantic meaning.

Mediator Nodes

Mediator nodes, also called CVT nodes, are used in Freebase to represent n-ary relationships. The figure below shows a CVT node connected to an award, a nominee, and a work. This or similar approach is necessary for accurate modeling of the real world.

image

Type System

Freebase categorizes each topic into one or more types and each type into one domain. Furthermore, the triple instances satisfy pseudo constraints as if they are governed by a rigorous type system. Specifically, 1) given a node, its types set up constraints on the labels of its properties; the type segment in the label of an edge (which is different from the edge type) in most cases belongs to one of the types of the subject node. 2) Given an edge type and its edge instances, there is almost a function that maps from the edge type to a type that all subjects in the edge instances belong to, and similarly almost such a function for objects.

Dataset

Four variants of the Freebase dataset are provided by the inclusion/exclusion of various data modeling idiosyncrasies, which enables researchers to leverage or avoid such features based on the nature of their tasks. The dataset can be downloaded from this link.

Dataset Statistics

variant CVT nodes reverse triples #entities #properties #triples
FB-CVT-REV removed removed 46,069,321 3,055 125,124,274
FB-CVT+REV removed retained 46,077,533 5,028 238,981,274
FB+CVT-REV retained removed 59,894,890 2,641 134,213,735
FB+CVT+REV retained retained 59,896,902 4,425 244,112,599

Dataset Details

The dataset consists of the four variants of the Freebase dataset as well as related mapping/support files. For each variant, we made three kinds of files available:

  • Subject matter triples file
    • fb+/-CVT+/-REV One folder for each variant. In each folder, there are 5 files: train.txt, valid.txt, test.txt, entity2id.txt, relation2id.txt Subject matter triples are the triples that belong to subject matter domains—domains describing real-world facts.
      • Example of a row in train.txt, valid.txt, and test.txt
        • 2, 192, 0

      • Example of a row in entity2id.txt:
        • /g/112yfy2xr, 2
      • Example of a row in relation2id.txt:
        • /music/album/release_type, 192
      • Explanation
        • "/g/112yfy2xr" and "/m/02lx2r" are the MID of the subject entity and object entity, respectively. "/music/album/release_type" is the realtionship between the two entities. 2, 192, and 0 are the IDs assigned by the authors to the objects.
  • Type system file
    • freebase_endtypes: Each row maps an edge type to its required subject type and object type.
      • Example
        • 92, 47178872, 90

      • Explanation
        • "92" and "90" are the type id of the subject and object which has the relationship id "47178872".
  • Metadata files
    • object_types: Each row maps the MID of a Freebase object to a type it belongs to.
      • Example
        • /g/11b41c22g, /type/object/type, /people/person

      • Explanation
        • The entity with MID "/g/11b41c22g" has a type "/people/person"
    • object_names: Each row maps the MID of a Freebase object to its textual label.
      • Example
        • /g/11b78qtr5m, /type/object/name, "Viroliano Tries Jazz"@en

      • Explanation
        • The entity with MID "/g/11b78qtr5m" has the name "Viroliano Tries Jazz" in English.
    • object_ids: Each row maps the MID of a Freebase object to its user-friendly identifier.
      • Example
        • /m/05v3y9r, /type/object/id, "/music/live_album/concert"

      • Explanation
        • The entity with MID "/m/05v3y9r" can be interpreted by humans as a music concert live album.
    • domains_id_label: Each row maps the MID of a Freebase domain to its label.
      • Example
        • /m/05v4pmy, geology, 77

      • Explanation
        • The object with MID "/m/05v4pmy" in Freebase is the domain "geology", and has id "77" in our dataset.
    • types_id_label: Each row maps the MID of a Freebase type to its label.
      • Example
        • /m/01xljxh, /government/political_party, 147

      • Explanation
        • The object with MID "/m/01xljxh" in Freebase is the type "/government/political_party", and has id "147" in our dataset.
    • entities_id_label: Each row maps the MID of a Freebase entity to its label.
      • Example
        • /g/11b78qtr5m, Viroliano Tries Jazz, 2234

      • Explanation
        • The entity with MID "/g/11b78qtr5m" in Freebase is "Viroliano Tries Jazz", and has id "2234" in our dataset.
    • properties_id_label: Each row maps the MID of a Freebase property to its label.
      • Example
        • /m/010h8tp2, /comedy/comedy_group/members, 47178867

      • Explanation
        • The object with MID "/m/010h8tp2" in Freebase is a property(relation/edge), it has the label "/comedy/comedy_group/members" and has id "47178867" in our dataset.
    • uri_original2simplified and uri_simplified2original: The mapping between the original URI and simplified URI and the mapping between simplified URI and original URI respectively.

Experiments & Results

We conducted all the link prediction experiments on four datasets using the DGL-KE framework (Zheng et al.,2020).

The hyperparameters used for each experiment, its training/test time, and more details can be found in the script provided for each dataset.

The results of these experiments on our datasets are shown in the table below.

FB-CVT-REV FB-CVT+REV FB+CVT-REV FB+CVT+REV
Model MRR MR H1 H10 MRR MR H1 H10 MRR MR H1 H10 MRR MR H1 H10
TransE 0.806 5.869 0.757 0.884 0.976 1.529 0.968 0.988 0.781 4.850 0.708 0.902 0.970 1.464 0.957 0.989
DistMult 0.703 70.498 0.664 0.775 0.952 9.239 0.941 0.970 0.612 81.841 0.562 0.704 0.927 12.924 0.913 0.951
ComplEx 0.719 67.740 0.684 0.783 0.958 8.437 0.950 0.972 0.624 83.205 0.577 0.708 0.928 13.278 0.915 0.951
TransR 0.663 58.553 0.620 0.743 0.944 5.982 0.931 0.967 0.640 47.524 0.580 0.754 0.935 6.071 0.916 0.969
RotatE 0.804 75.721 0.780 0.845 0.962 10.431 0.956 0.974 0.736 68.436 0.699 0.807 0.948 10.263 0.938 0.969

Another way of evaluating embedding models is to find their performance on triple classification. This task is the binary classification of triples regarding whether they are true or false facts. The results of our triple classification task are shown in the tables below.

consistent h inconsistent h
Model Precision Recall Acc F1 Precision Recall Acc F1
RESCAL 0.59 0.37 0.55 0.45 0.95 0.83 0.89 0.89
TransE 0.52 0.59 0.52 0.55 0.81 0.69 0.76 0.74
DistMult 0.53 0.51 0.53 0.52 0.94 0.87 0.91 0.90
ComplEx 0.54 0.48 0.53 0.51 0.94 0.88 0.91 0.91
ConvE 0.54 0.53 0.54 0.53 0.57 0.72 0.59 0.64
RotatE 0.52 0.53 0.52 0.52 0.89 0.83 0.87 0.86
consistent t inconsistent t
Model Precision Recall Acc F1 Precision Recall Acc F1
RESCAL 0.64 0.45 0.60 0.53 0.95 0.86 0.91 0.90
TransE 0.58 0.54 0.57 0.56 0.90 0.82 0.86 0.86
DistMult 0.59 0.55 0.58 0.57 0.95 0.89 0.92 0.92
ComplEx 0.60 0.56 0.59 0.58 0.95 0.90 0.93 0.92
ConvE 0.62 0.41 0.58 0.49 0.95 0.83 0.89 0.88
RotatE 0.60 0.47 0.58 0.53 0.87 0.78 0.83 0.82

The experiments on triple classification were done using the LibKGE framework. (Broscheit et al.,2020)

Scripts

Data Preparation Scripts

  • parse_triples.sh script is used for URI simplification.

  • FBDataDump.sh is a script that runs parse_triples.sh and creates different MySQL tables from Freebase data dump. For example, tables for domains, types, properties, and entities. Command to run FBDataDump.sh:

    ./FBDataDump.sh mysql_username mysql_password

    After running FBDataDump.sh, you may want to run one of the four scripts provided for each variant. All these four scripts detach the subject matter triples from the metadata and administrative triples. In addition, all these scripts create a type system for the final dataset. Command to run FBx.sh, where x ∈ {1,2,3,4}:

    ./FBx.sh mysql_username mysql_password

  • If you need to remove all the reverse triples as well as all the CVT nodes, you can run FB1.sh.

  • To keep the reverse triples but remove the CVT nodes, you can run FB2.sh.

  • To keep the CVT nodes but to remove the reverse triples, you can run FB3.sh.

  • To keep both CVT nodes and reverse triples, you can run script FB4.sh.

Experiments Scripts

We did experiments on the four variants of Freebase as well as FB15K and FB-15K-237 using link prediction models like TransE, DistMult, ComplEx, RotatE, etc. The scripts to run the experiments are at ExperimentsScripts/ ending with .sh. An example of running the DistMult model on FB1 is as below.

dglke_train --model_name DistMult --dataset Freebase --data_path ./data --format udd_hrt \ --data_files entity2id.txt relation2id.txt train.txt valid.txt test.txt --batch_size 1024 --neg_sample_size 256 --hidden_dim 400 --gamma 143.0 \ --lr 0.08 --batch_size_eval 1000 --test -adv --mix_cpu_gpu --num_proc 8 --gpu 0 1 --max_step 300000 --neg_sample_size_eval 1000 \ --eval_interval 100000 --log_interval 1000 --async_update --rel_part --force_sync_interval 10000 --num_thread 4 --no_save_emb --delimiter ,

Related Work

Please feel free to check out another paper of ours related to this topic: Realistic re-evaluation of knowledge graph completion methods: An experimental study

License

The dataset and code are made available under the CC0 1.0 Universal.

Note: Freebase Data Dumps is provided free of charge for any purpose. It is distributed under the Creative Commons Attribution (aka CC-BY) and the usage is subject to the Terms of Service.