Code and data accompanying the paper "Q²: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering". Q² is a reference-free metric that aims to evaluate the factual consistency of knowledge-grounded dialogue systems. Our approach is based on automatic question generation and question answering.
The datasets are available in the third_party/data
folder.
They are based on a subset of the Wizard of Wikipedia dataset.
Specifically, we ran inference using two dialogue systems and manually annotated each response for factual consistency.
The data is available in four files - consistent and inconsistent responses - for each of the two systems. Each file contains 150 responses.
In addition, we add the cross anotation file, used in our validation experiments.
The datasets are stored as pandas dataframes in csv files. Loading them should be as simple as:
In [1]: import pandas as pd
In [2]: data = pd.read_csv("third_party/data/dodeca_consistent.csv")
In [3]: data.head()
Out[3]:
episode_idx round ... response knowledge gold
0 0 0 ... i love gardening as well . it is considered to... Gardening is considered by many people to be a... I live on a farm, we garden all year long, it ...
1 1 2 ... it aired in the us , canada and latin america He was the creator and host of "The Joy of Pai... The show aired from 1983 to 1994 on PBS.
2 4 1 ... well , there are three categories of finance :... Finance can be broken into three sub-categorie... Great! There are three categories. The public,...
The key columns are:
episode_idx
: The index of the conversation in the WoW validation data.round
: The turn in the conversation.response
: The model's response for the given turn.knowledge
: The knowledge given to the model for the given turn.gold
: The gold-standard (human) response for the given turn.
- python 3.7
- numpy==1.19.2
- pandas==1.1.3
- bert-score==0.3.7
- spacy==2.3.2
- torch==1.6.0
- transformers==3.2.0
To compare against baselines and run validation experiments, you'll additionally need:
- scikit-learn==0.20.2
- scipy==1.1.0
- sacrebleu==1.4.14
For the NLI-based comparison:
- allennlp==1.0.0
- allennlp-models==1.0.0
To run Q², first run pipeline/run_pipeline.py
and specify the parameters.
Use the save_steps flag, which will later enable measuring answer similarity using an NLI system.
For example:
python pipeline/run_pipeline.py \
--infile third_party/data/dodeca_inconsistent.csv \
--gen_method beam \
--q_per_cand single \
--personal remove \
--outfile dodeca_inconsistent_out \
--save_steps
The output will be two csv files: dodeca_inconsistent_out.csv
, and dodeca_inconsistent_out.steps.csv
. The first will
contain the Q² score for each input example, without the NLI-based answer spans comparison (i.e., based on the F1
token-level comparison). The second file will contain the various Q² steps, i.e., generated questions and answers.
Then, run run_nli.py
with the steps file generated at the previous step (in the example above,
dodeca_inconsistent_out.steps.csv
).
For example:
python run_nli.py \
--infile dodeca_inconsistent_out.steps.csv \
--outfile dodeca_inconsistent_scores.csv \
--task span_comparison
The output will be a csv file containing the Q² scores, with and without the NLI-based comparison.
The response-level evaluation includes measuring the Precision/Recall tradeoff of inconsistency and consistency detection at a single-example level, using various thresholds.
To plot the Precision-Recall curve vs. various thresholds, run precision_recall.py
and specify the parameters.
python precision_recall.py \
--incons_dodeca_f dodeca_inconsistent_scores.csv \
--cons_dodeca_f dodeca_consistent_scores.csv \
--incons_memnet_f memnet_inconsistent_scores.csv \
--cons_memnet_f memnet_consistent_scores.csv \
--metrics_names 'Q2' 'Q2_no_nli'
Each input file should be obtained by running pipeline/run_pipeline.py
followed by run_nli.py
, as
explained under Usage.
The output files will include two plots for each input metric: grounded and ungrounded Precision and Recall vs. various thresholds. If more than one metric was provided as an input, the output will include two additional plots, comparing the grounded and ungrounded Precision-Recall trade-off for all input metrics. Other than the plots, the accuracy given a specific threshold, as well as the grounded and ungrounded Precision and Recall, will be printed.
metrics_names should be one or more space-separated names of the tested metrics.
For the specific threshold computation, use the thresholds
flag, which should be one or more space-separated values of
thresholds, one for each specified metric name selected. If thresholds weren't specified, the computation will use a
threshold of 0.5.
To add baseline methods to the Precision-Recall computation, specify the add_baselines
flag. Specifying the flag will
add all baselines other than the end-to-end NLI. To add end-to-end NLI, see the End-to-end NLI
section.
To compare new metrics to q-squared, add a column containing the new metric's scores for each of the above csv files,
and add the name of this column to the names passed in the metrics_names
flag. Note that scores should be normalized
to [0,1].
For system-level evaluation, first run pipeline/prep_sys_experiment.py
and specify the parameters.
The infile
should be the file containing the extended annotations, for both the dodeca and memnet systems.
python pipeline/prep_sys_experiment.py \
--infile third_party/data/cross_annotation.csv \
--outfile cross_annotation_out
This will create two output files - one for each system: cross_annotation_out_dodeca.csv
, and
cross_annotation_out_memnet.csv
Then, run run_nli.py
for each of the two files and use the for_systems_simulation
flag:
For example:
python run_nli.py \
--infile cross_annotation_out_dodeca.csv \
--outfile cross_annotation_dodeca_scores.csv \
--task span_comparison \
--for_systems_simulation
Finally, run system_level.py
with the two files generated at the previous step.
python system_level.py \
--dodeca_path cross_annotation_dodeca_scores.csv \
--memnet_path cross_annotation_memnet_scores.csv \
--metrics_names 'Q2' 'Q2_no_nli'
To add baseline methods to the Precision-Recall computation, specify the add_baselines
flag.
As in the response-level evaluation, to compare new metrics to q-squared, add a column containing the new metric's
scores for each of the above csv files, and add the name of this column to the names passed in the metrics_names
flag.
In the end-to-end NLI method, the knowledge is used as the hypothesis and the response as the premise. To add the end-to-end NLI method, run
python run_nli.py \
--infile dodeca_inconsistent_scores.csv \
--outfile dodeca_inconsistent_with_e2e_baseline.csv \
--task e2e
Similarly for the other data files. The output file will be the input file with an additional column, containing the end-to-end NLI-based factual consistency scores.
@inproceedings{honovich-etal-2021-evaluating,
title = "Q²: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering",
author = "Honovich, Or and
Choshen, Leshem and
Aharoni, Roee and
Neeman, Ella and
Szpektor, Idan and
Abend, Omri",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2021",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2104.08202",
}