This repository contains a benchmark evaluation of Knowledge Editing using logical rules. Our methodology includes multi-hop questions generated using logical rules to evaluate the effectiveness of knowledge editing methods. We conducted experiments on the popular approaches ROME, FT and KN and, the results show a considerable performance gap of up to 24% between evaluations on directly edited knowledge and on entailed knowledge particularly on ROME and FT.
To start, install the required packages:
cd evaluate_rules
pip install torch
pip install -r requirements.txt
Ensure that all dependencies are correctly installed
To get triples from the KG, we extracted entities in MLaKE and MQuAKE which we use to query the DICE Dbpedia endpoint and construct our sub-knowledge graph. We use the following SPARQL queries :
query = f"""SELECT ?s ?p ?o
WHERE {{
{{
SELECT ?s ?p ?o WHERE {{
VALUES ?s {{ <{entity_uris}> }}
?s ?p ?o
FILTER (
?p != <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> &&
?p != <http://www.w3.org/2000/01/rdf-schema#label> &&
?p != <http://www.w3.org/2002/07/owl#sameAs> &&
?p != <http://dbpedia.org/property/wikiPageUsesTemplate> &&
?p != <http://dbpedia.org/ontology/wikiPageRedirects> &&
?p != <http://dbpedia.org/ontology/almaMater> &&
?p != <http://dbpedia.org/ontology/wikiPageExternalLink> &&
?p != <http://dbpedia.org/ontology/wikiPageWikiLink> &&
?p != <http://www.w3.org/2000/01/rdf-schema#comment>
)
}}
LIMIT 100
}} UNION
{{
SELECT ?s ?p ?o WHERE {{
VALUES ?o {{ <{entity_uris}> }}
?s ?p ?o
FILTER (
?p != <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> &&
?p != <http://www.w3.org/2000/01/rdf-schema#label> &&
?p != <http://www.w3.org/2002/07/owl#sameAs> &&
?p != <http://dbpedia.org/property/wikiPageUsesTemplate> &&
?p != <http://dbpedia.org/ontology/wikiPageRedirects> &&
?p != <http://dbpedia.org/ontology/almaMater> &&
?p != <http://dbpedia.org/ontology/wikiPageExternalLink> &&
?p != <http://dbpedia.org/ontology/wikiPageWikiLink> &&
?p != <http://www.w3.org/2000/01/rdf-schema#comment>
)
}}
LIMIT 100
}}
}}"""
To get the triples files in the corrected form, run:
cd evaluate_rules/
python SparqlQuery.py
To get the rules, run the following commands:
cd evaluate_rules/amie
java -jar amie-dev.jar -mins 1 ../all_triples/processed_triples3.txt > ../all_triples/output_file.txt
Make sure that you have the latest version [Java] installed to run AMIE, Download an AMIE executable jar file [AMIE-JAR], and run the commands above.
To generate the questions and answer run:
python generateQA.py
The outputs print the question and the answer related to the given rules and facts and save the QA dict into a file.
The datasets used in the experiments, all triples, rules and multihop_qa_pairs for each dataset are found in /evaluate_rules/all_triples .
We first run KE methods over the selected datasets (MLaKE and MQuAKE) and save the models weights To do so, you will need to clone rome repository into your local folder, and and run the following commands :
git clone https://github.com/kmeng01/rome.git
cd rome/rome or cd rome/baselines for others KE
python rome_main.py --model_name openai-community/gpt2-large --dataset_path ../evaluate_rules/all_triples/MLaKE/new_en-qa.json --config ../hparams/ROME/gpt2-large.json --save_dir edited_models #Feel free to change the model and the datasets path
Config files for each KE can be found in /hparams and others KE are placed in /baselines. Some examples of python code used to run each KE are found in /examples folder.
To evaluate existing KE techniques on directly edited or correlated knowledge after saving models weights, run the followings commands .
python evaluate_rules/rome_eval.py #for correlated knowledge
python evaluate_rules/rome_eval_direct #for directly edited knowledge
This will save the evaluation results in /results
| MLaKE | MQuAKE | |
|---|---|---|
| Models | F1 | F1 |
| gpt2-medium | 16.36 | 4.21 |
| gpt2-large | 10.23 | 2.13 |
| gpt2-xl | 12.90 | 1.51 |
| gpt-j | 8.56 | - |
| Correlated knowledge | ||
| gpt2-medium | 8.67 | 1.91 |
| gpt2-large | 6.08 | 2.42 |
| gpt2-xl | 7.17 | 3.84 |
| gpt-j | 15.91 | - |
| MLaKE | MQuAKE | |
|---|---|---|
| Models | F1 | F1 |
| gpt2-medium_constr | 15.18 | 4.97 |
| gpt2-large_constr | 24.58 | 9.10 |
| gpt2-xl_constr | 17.15 | 4.25 |
| Correlated knowledge | ||
| gpt2-medium_constr | 0.90 | 0.008 |
| gpt2-large_constr | 0.45 | 0.28 |
| gpt2-xl_constr | 0.63 | 0.0 |
| MLaKE | MQuAKE | |
|---|---|---|
| Models | F1 | F1 |
| gpt2-xl | 1.34 | 4.67 |
| Correlated knowledge | ||
| gpt2-xl | 14.26 | 18.53 |
In the future, we will conduct the following experiments by adding other KE methods such as:
MENDKEMeLLo
Including the following knowledge graphs:
WikidataYAGO
| Model | Status |
|---|---|
| LLama2 | Upcoming |
| GPT-3-based architectures | In progress |
| Mistral | Upcoming |
Moteu Ngoli, T. (2025). Benchmarking Knowledge Editing using Logical Rules (1.0.0) [Data set]. THE 24th INTERNATIONAL SEMANTIC WEB CONFERENCE (ISWC 2025), Nara, Japan. Zenodo. https://doi.org/10.5281/zenodo.15697400
https://doi.org/10.5281/zenodo.15697400
Feel free to contact us at tatianam@mail.uni-paderborn.de if you have any questions.
