This is the code for the paper submitted to the special issue of the SWJ on Knowledge Graph Construction.
First clone the repo
git clone git@github.com:SonyCSLParis/graph_search_framework.git
We used Poetry and conda for virtual environment and dependency management.
Interface and traversal implemented with Python 3.10.6.
First set up your virtual environment and then download Poetry for dependencies.
poetry install --no-root
Alternatively, you can use the full path to the poetry
binary
~/Library/"Application Support"/pypoetry/venv/bin/poetry
on MacOS.~/.local/share/pypoetry/venv/bin/poetry
on Linux/Unix.%APPDATA%\pypoetry\venv\Scripts\poetry
on Windows.$POETRY_HOME/venv/bin/poetry
if$POETRY_HOME
is set.
If you work on an Apple Silicon Machine + conda, you might later be prompted to download again grpcio
, you can do it using:
pip uninstall grpcio
conda install grpcio
Create a private.py
file in the settings folder and add the followings:
- AGENT (of computer, for sparql interface) [optional]
- TOKEN (for Triply) [optional]
- FOLDER_PATH (of git repository on your machine)
[For submission] We use an external package that is currently WIP, for the purpose of this submission we include it directly into this code. To run its dependencies, run:
cd kglab && python setup.py install
Then run the following for setting up the packages
python setup.py install
The main data format we used are the HDT compressed versions of the datasets.
Some pointers for downloading the datasets:
It's faster to download direct the .hdt
file and the .hdt.index
files.
Within our experiments, we used the followings:
- Triply DB's HDT version of DBpedia (snapshot 2021-09)
- Wikidata (2021-03-05)
- YAGO4 downloaded from the website. We later used hdt-cpp to convert it to HDT format.
We put the datasets in the root directory of the repository, under the names dbpedia-snapshot-2021-09
, wikidata-2021-03-05
and yago-2020-02-24
respectively. We query the HDT datasets using pyHDT.
We occasionnaly worked with Triply DB's data:
- DBpedia 2021-09. We used API calls using https://api.triplydb.com/datasets/DBpedia-association/snapshot-2021-09/fragments/?limit=10000.
We include some sample data in the sample-data
folder.
Before running the search, you need to extrac domain, range and superclasses information from the dataset you downloaded. See file src/extract_domain_range.py
for further information and command lines to run that file, depending on your dataset.
You can run one search using this sample data from the root directory, by running:
python src/framework.py -j sample-data/French_Revolution_config.json
The results will be saved in the experiments
folder in the root directory, in a folder starting by <date>-<dataset_type>-<name_exp>
.
You can change the content of this configuration file. Some changes can be immediate, some others will require some additional data download (c.f. Section 4 to add further data for the search).
Click here to know more about the config file
Parameters that don't require additional data to be downloaded:
rdf_type
: the type of nodes you want to retrieve. Keys should be a string, and values the string URI of that node type. In our experiments, we are mainly interested about events.predicate_filter
: list of predicates that are not taken into account for the searchstart
: node to start the search fromstart_date
: starting date of thatstart
nodeend_date
: ending date of thatstart
nodeiterations
: number of iterations for the search. The higher the number, the longer it will take to run.type_ranking
: the type of ranking to use for paths.type_interface
: type of interface used, in practicehdt
only.type_metrics
: the metrics that are computed, should be a sub-list of["precision", "recall", "f1"]
ordering
anddomain_range
: boolean, to activate or not this parameterfiltering
: same than abovename_exp
: name of your experiment, for the saving folderdataset_type
: type of dataset, depending on the one you havedataset_path
: path the the dataset foldernested_dataset
: boolean, whether your dataset is nested (decomposed in smaller chunks) or not
Parameters that require additional data to be downloaded - c.f. section 4 for further details:
gold_standard
: .csv path to the gold standard eventsreferents
: .json path to the URI referents
If you have downloaded DBpedia, Wikidata or YAGO, it is possible to run the search with any of the events that is both in EventKG and in your dataset. We used EventKG 3.1. in our experiments.
We propose 3 notebooks in the notebooks
folder to extract additional data to run the search. You will also need to download GraphDB to set up a local SPARQL endpoint.
Corresponding notebook: eventkg-filtering.ipynb
- Pre-requisites. Download EventKG & GraphDB (links in paragraphs above).
- Main motivation. Problems when parsing EventKG data to GraphDB + working with a smaller subset of EventKG.
- How.: Using dask and pandas to read the data and only select the parts we were interested for our queries + some preprocessing
- Main usage. Load the newly saved data into GraphDB to set up a local SPARQL endpoint.
Before running one of the two notebooks below, you need to make sure that the data was downloaded into a GraphDB repositories and that the endpoint is active.
Corresponding notebook: eventkg-info-one-event.ipynb
- Pre-requisites Data loaded into GraphDB and SPARQL endpoint active. If you additionnally want to run the search at the end of the notebook, you need to have the dataset for search downloaded as well.
- Main motivation. Running the search with more events than the one in
sample-data
. - How: SPARQL queries to extract ground truth events, referents, start and end dates for an event and to generate a config file to run a search.
- Main usage. Config file for the graph search.
Corresponding notebook: eventkg-retrieving-events.ipynb
- Prerequisites. Data loaded into GraphDB and SPARQL endpoint active. If you additionnally want to run the search at the end of the notebook, you need to have the dataset for search downloaded as well.
- Main motivation. Running the search with all events in a dataset.
- How. SPARQL queries.
- Main purpose Config files for the graph search.
There are different scripts to run to reproduce the experiments described in the paper. First make sure that you have downloaded the data (cf. Sections above).
All the experiments are described in a separate README in the experiments_run
folder, please refer to it for additional information.
We also implemented an interface to compare the impact of the filters and parameters on the search - ordering
and filtering
from the config description in Section 3. of the README. By comparing two sets of parameters, you will also run the search in the backend.
To run a search, you might need to extract some additional information (c.f. Section 4. of the README).
In the terminal command, go to the app folder.
cd app
First open the variables.py
file in that folder. You can add information on the dataset(s) you are using (VARIABLES_DATASET
)). As specified in that file, you need to enter details about dataset_path
, data_files_path
(folder where are stored ground truth, referents and config files), start_uri
and nested_dataset
. You can also change the default values that will be displayed (DEFAULT_VARIABLES
).
Then run the following to run the interface:
streamlit run app.py
Depending on the parameter and event that you choose, running the search can be slow. Likewise, displaying the HTML graphs can be slow.
If you want pycache content or other removed, you can run:
sh clean.sh
Python unittests were created to test out different components of the graph search framework. To run them all, run in terminal (from root directory of the repository):
coverage run -m unittest discover -s src/tests/
coverage html
open -a Safari htmlcov/index.html