This is the repository for our paper MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions.
In this paper, we introduce a benchmark for knowledge editing, MQuAKE, which comprises multi-hop questions that assess whether edited models correctly answer questions where the answer should change as an entailed consequence of edited facts.
We also propose a simple memory-based approach, MeLLo, which can scale with LLMs (up to 175B) and outperforms previous model editors by a large margin.
Please see our paper for more details.
[2024/9 Update]
We have resolved a knowledge conflict issue in the original MQuAKE-CF-3k
dataset. We updated this subset in datasets/MQuAKE-CF-3k-v2.json
and updated results in our paper. We recommend future researchers follow
this setting as wel.
MQuAKE includes a dataset MQuAKE-CF based on counterfactual edits, and another dataset MQuAKE-T of temporal knowledge updates to evaluate model editors on real-world changes.
The datasets are included in datasets/
. There are three files:
MQuAKE-CF-3k-v2.json
: a counterfactual dataset containing 3,000 instances. The results shown in our current paper are based on this dataset (as mentioned in the footnote 2 of the paper).MQuAKE-CF.json
: the full counterfactual dataset containing 9,218 instances.MQuAKE-T.json
: the temporal-based dataset containing 1,825 instances. This is designed to evaluate knowledge editing methods on real-world changes.MQuAKE-CF-3k.json
: the first version ofMQuAKE-CF-3k
, where there could be knowledge conflict when conducting multi-edit experiments.
The dataset is saved as a list of dicts, each of which represents a data instance. An example in MQuAKE-CF
is shown below.
{
"case_id": 1561,
"requested_rewrite": [
{
"prompt": "{} is associated with the sport of",
"relation_id": "P641",
"target_new": {"str": "cricket", "id": "Q5375"},
"target_true": {"str": "association football", "id": "Q2736"},
"subject": "Dudley Town F.C.",
"question": "Which sport is Dudley Town F.C. associated with?"
},
...
],
"questions": [
"What is the capital of the country where Dudley Town F.C.'s sport originated?",
"Which city serves as the capital of the country where the sport played by Dudley Town F.C. originated?",
"Which city is the capital of the country where the sport of Dudley Town F.C. was created?"
],
"answer": "London",
"answer_alias": ["London UK", ...],
"new_answer": "Oderzo",
"new_answer_alias": [],
"single_hops": [
{
"question": "Which sport is Dudley Town F.C. associated with?",
"cloze": "Dudley Town F.C. is associated with the sport of",
"answer": "association football",
"answer_alias": ["football", ...]
},
...
],
"new_single_hops": [...],
"orig": {
"triples": [
["Q5311995", "P641", "Q2736"],
["Q2736", "P495", "Q21"],
["Q21", "P36", "Q84"]
],
"triples_labeled": [
["Dudley Town F.C.", "sport", "association football"],
...,
],
"new_triples": [...,],
"new_triples_labeled": [...,],
"edit_triples": [
["Q5311995", "P641", "Q5375"],
["Q5375", "P495", "Q408"],
...
]
}
}
requested_rewrite
: a list of the edited facts that we want to inject into the language model. In general, we follow the format of theCounterfact
dataset. We use a cloze-sytle statement for the edits and separately specify the subject tokens, which are used in some baselines (e.g., ROME, MEMIT).questions
: three multi-hop questions generated bygpt-3.5-turbo
. We evaluate the edited language model on all the three questions and regard the edit successful if the edited model can answer any of these questions.answer
andanswer_alias
: the gold answer before injecting new facts into language models.answer_alias
is a list of aliases of the answer extracted from Wikidata.new_answer
andnew_answer_alias
: the gold answer after injecting new facts into language models.single_hops
: the single-hop questions that are associated with the chain of facts before editing. These questions are used to test if a language model has encoded all single-hop facts to answer the multi-hop questions.new_single_hops
: the single-hop questions that are associated with the chain of facts after editing.orig
: the raw data from Wikidata.triples
andnew_triples
: the corresponding list of(s, r, o)
fact triples before and after editing.triples_labeled
andnew_triples_labeled
: the list of labeled fact triples.edited_triples
: the list of edited facts(s, r, o*)
that we want to inject into language models.
For MQuAKE-T only:
answer_extended
: the extended gold answers before injecting new facts into language models. We extend the pre-edit gold answer for MQuAKE-T to minimize the effects of mismatch of the LM training corpus and our Wikidata dump. This includes other possible gold answers besides the one we extract from our Wikidata dump (see Appendix E of our paper).
There are many ways to check whether a fact is stored in a language model or not, e.g., cloze-style statement vs question, in-context-learning vs zero-shot prompting, CoT vs standard prompting.
We include evaluation setups that we use in our paper.
We follow the setups in prior work. We directly query the (edited) language models with a cloze-sytle statement (the same statement we used to inject the fact) without in-context-learning examples. In this case, the model output format is correct even without ICL, because the models are updated with the same cloze-style format and the likelihood of the gold answers is optimized when performing the edits.
In this case, to ensure the model output format is desirable, we use questions with in-context-learning examples to prompt the language models.
For each relation type, we write a prompt with 8 demonstrations. The prompts we used for each relation can be found in prompts/rel-prompts.json
.
We use either standard prompting or chain-of-thought (CoT) prompting to query the model with multi-hop questions. We use in-context-learning in both cases to ensure the output format is desirable. The prompts we used can be found in prompts/multihop-prompts.txt
and prompts/multihop-cot-prompts.txt
.
We propose a simple but effective method MeLLo, which (1) decomposes a multi-hop questions into subquestions; (2) prompts the base language model to provide tentative answers to subquestions; and (3) self-checks whether the tentative answers contradict any edited facts in the memory. See more details in our paper.
The in-context-learning examples we used in MeLLo can be founded in prompts/MeLLo-prompts.txt
.
A python notebook for running MeLLo on text-davinci-003
is here: run_mello.ipynb
.
If you have any questions related to the repo or the paper, or you encounter any problems when using the datasets/code, feel free to email Zexuan Zhong (zzhong@cs.princeton.edu)
or open an issue!
If you use our code in your research, please cite our work:
@article{zhong2023mquake,
title={{MQuAKE}: Assessing Knowledge Editing in Language Models via Multi-Hop Questions},
author={Zhong, Zexuan and Wu, Zhengxuan and Manning, Christopher D and Potts, Christopher and Chen, Danqi},
journal={arXiv preprint arXiv:2305.14795},
year={2023}
}