Skip to content

SonyCSLParis/graph_search_framework

Repository files navigation

ChronoGrapher: Event-centric KG Construction via Informed Graph Traversal

This is the code for the paper submitted to the special issue of the SWJ on Knowledge Graph Construction.

First clone the repo

git clone git@github.com:SonyCSLParis/graph_search_framework.git

1. Set Up Virtual Environment

We used Poetry and conda for virtual environment and dependency management.

Interface and traversal implemented with Python 3.10.6.

First set up your virtual environment and then download Poetry for dependencies.

To install dependencies only

poetry install --no-root

Alternatively, you can use the full path to the poetry binary

  • ~/Library/"Application Support"/pypoetry/venv/bin/poetry on MacOS.
  • ~/.local/share/pypoetry/venv/bin/poetry on Linux/Unix.
  • %APPDATA%\pypoetry\venv\Scripts\poetry on Windows.
  • $POETRY_HOME/venv/bin/poetry if $POETRY_HOME is set.

If you work on an Apple Silicon Machine + conda, you might later be prompted to download again grpcio, you can do it using:

pip uninstall grpcio
conda install grpcio

Create a private.py file in the settings folder and add the followings:

  • AGENT (of computer, for sparql interface) [optional]
  • TOKEN (for Triply) [optional]
  • FOLDER_PATH (of git repository on your machine)

[For submission] We use an external package that is currently WIP, for the purpose of this submission we include it directly into this code. To run its dependencies, run:

cd kglab && python setup.py install

Then run the following for setting up the packages

python setup.py install

2. Download data for the graph traversal

The main data format we used are the HDT compressed versions of the datasets.

Some pointers for downloading the datasets:

  • Some datasets can be downloaded here.
  • DBpedia-2016-10 can also be downloaded here.

It's faster to download direct the .hdt file and the .hdt.index files.

Within our experiments, we used the followings:

  • Triply DB's HDT version of DBpedia (snapshot 2021-09)
  • Wikidata (2021-03-05)
  • YAGO4 downloaded from the website. We later used hdt-cpp to convert it to HDT format.

We put the datasets in the root directory of the repository, under the names dbpedia-snapshot-2021-09, wikidata-2021-03-05 and yago-2020-02-24 respectively. We query the HDT datasets using pyHDT.

We occasionnaly worked with Triply DB's data:


3. Run the search

We include some sample data in the sample-data folder.

Before running the search, you need to extrac domain, range and superclasses information from the dataset you downloaded. See file src/extract_domain_range.py for further information and command lines to run that file, depending on your dataset.

You can run one search using this sample data from the root directory, by running:

python src/framework.py -j sample-data/French_Revolution_config.json

The results will be saved in the experiments folder in the root directory, in a folder starting by <date>-<dataset_type>-<name_exp>.

You can change the content of this configuration file. Some changes can be immediate, some others will require some additional data download (c.f. Section 4 to add further data for the search).

Click here to know more about the config file

Parameters that don't require additional data to be downloaded:

  • rdf_type: the type of nodes you want to retrieve. Keys should be a string, and values the string URI of that node type. In our experiments, we are mainly interested about events.
  • predicate_filter: list of predicates that are not taken into account for the search
  • start: node to start the search from
  • start_date: starting date of that start node
  • end_date: ending date of that start node
  • iterations: number of iterations for the search. The higher the number, the longer it will take to run.
  • type_ranking: the type of ranking to use for paths.
  • type_interface: type of interface used, in practice hdt only.
  • type_metrics: the metrics that are computed, should be a sub-list of ["precision", "recall", "f1"]
  • ordering and domain_range: boolean, to activate or not this parameter
  • filtering: same than above
  • name_exp: name of your experiment, for the saving folder
  • dataset_type: type of dataset, depending on the one you have
  • dataset_path: path the the dataset folder
  • nested_dataset: boolean, whether your dataset is nested (decomposed in smaller chunks) or not

Parameters that require additional data to be downloaded - c.f. section 4 for further details:

  • gold_standard: .csv path to the gold standard events
  • referents: .json path to the URI referents

4. Download data for ground truth comparison

If you have downloaded DBpedia, Wikidata or YAGO, it is possible to run the search with any of the events that is both in EventKG and in your dataset. We used EventKG 3.1. in our experiments.

We propose 3 notebooks in the notebooks folder to extract additional data to run the search. You will also need to download GraphDB to set up a local SPARQL endpoint.

- Preprocessing EventKG and loading it into GraphDB

Corresponding notebook: eventkg-filtering.ipynb

  • Pre-requisites. Download EventKG & GraphDB (links in paragraphs above).
  • Main motivation. Problems when parsing EventKG data to GraphDB + working with a smaller subset of EventKG.
  • How.: Using dask and pandas to read the data and only select the parts we were interested for our queries + some preprocessing
  • Main usage. Load the newly saved data into GraphDB to set up a local SPARQL endpoint.

Before running one of the two notebooks below, you need to make sure that the data was downloaded into a GraphDB repositories and that the endpoint is active.

- Extracting info for one events in a dataset

Corresponding notebook: eventkg-info-one-event.ipynb

  • Pre-requisites Data loaded into GraphDB and SPARQL endpoint active. If you additionnally want to run the search at the end of the notebook, you need to have the dataset for search downloaded as well.
  • Main motivation. Running the search with more events than the one in sample-data.
  • How: SPARQL queries to extract ground truth events, referents, start and end dates for an event and to generate a config file to run a search.
  • Main usage. Config file for the graph search.

- Extracting info for all events in a dataset

Corresponding notebook: eventkg-retrieving-events.ipynb

  • Prerequisites. Data loaded into GraphDB and SPARQL endpoint active. If you additionnally want to run the search at the end of the notebook, you need to have the dataset for search downloaded as well.
  • Main motivation. Running the search with all events in a dataset.
  • How. SPARQL queries.
  • Main purpose Config files for the graph search.

5. Reproducibility

There are different scripts to run to reproduce the experiments described in the paper. First make sure that you have downloaded the data (cf. Sections above).

All the experiments are described in a separate README in the experiments_run folder, please refer to it for additional information.

6. Run the interface


We also implemented an interface to compare the impact of the filters and parameters on the search - ordering and filtering from the config description in Section 3. of the README. By comparing two sets of parameters, you will also run the search in the backend.

To run a search, you might need to extract some additional information (c.f. Section 4. of the README).

In the terminal command, go to the app folder.

cd app

First open the variables.py file in that folder. You can add information on the dataset(s) you are using (VARIABLES_DATASET)). As specified in that file, you need to enter details about dataset_path, data_files_path (folder where are stored ground truth, referents and config files), start_uri and nested_dataset. You can also change the default values that will be displayed (DEFAULT_VARIABLES).

Then run the following to run the interface:

streamlit run app.py

Depending on the parameter and event that you choose, running the search can be slow. Likewise, displaying the HTML graphs can be slow.


7. Other

If you want pycache content or other removed, you can run:

sh clean.sh

Python unittests were created to test out different components of the graph search framework. To run them all, run in terminal (from root directory of the repository):

coverage run -m unittest discover -s src/tests/
coverage html
open -a Safari htmlcov/index.html