This repository contains codes for running hallucination detection from the following paper.
Detecting Hallucinated Content in Conditional Neural Sequence Generation
Chunting Zhou, Graham Neubig, Jiatao Gu, Mona Diab, Paco Guzman, Luke Zettlemoyer, Marjan Ghazvininejad
ACL-Finding 2021
- This repo is based on fairseq (tag v0.9.0) (please follow the instructions in the fairseq repo for requirements on apex, Python and Pytorch version.)
Under your anoconda environment, please install fairseq from source locally with:
python setup.py build_ext --inplace
We will explain to you how to train a hallucination model on your own bi-text dataset and make predictions.
We used the large multi-domain data set collected in this paper (Wang et al., 2020), which includes four domains (law, news, patent, tvsubtitles). Since it involves the data from LDC, we could not publish it.
We have two benchmark datasets for MT and summarization (XSum) respectively in this repo (./eval_data/
).
We train two MT systems (standard Transformer and finetuned MBART) on the simulated low-resource (patent domain) training data, and evaluate on the patent domain.
We ask bilingual speakers to evaluate if machine translations contain hallucinations at token-level on 150 sentences from the patent test set.
Under ./eval_data/mt/
, *source
are raw source sentences, *target
are model outputs, *ref
are references, *label
are annotated labels of *target
.
1
indicates hallucinated words and 0
indicates faithful translation words.
./eval_data/mt/trans2s.*
are annotations for standard Transformer outputs../eval_data/mt/mbart.*
are annotations for finetuned MBART outputs.
We processed the annotations released from google by aggregating the labels for each word from 3 annotators with majority voting.
The aggregated results for four models (BERTSeq2Seq, Pointer-generator, Topic-aware Convolutional Network and standard Transformer Seq2Seq) are under ./eval_data/xsum/
.
To train a hallucination prediction model on your own bi-text dataset, the first step is creating the synthetic labeled data. This is decomposed into the following two sub-steps.
-
Generate synthetic target data with BART
You can tune the hyperparameters for generating noised data at the top of
./util_scripts/run_gen_synthetic_data_with_bart.sh
, then run the following command. The set of noise hyperparameters will be used to name the output, namelyconfig
.Please first download the BART (for English, here) or MBART (for other languages, here, we noticed that the MBART model released in fairseq is broken) model and then specify the path to model and bpe dictionary in
Line 33-45
of./util_scripts/gen_bart_batch.py
. Then run the following command:bash ./util_scripts/run_gen_synthetic_data_with_bart.sh path/to/the/target/file path/to/the/valid/file
e.g.,
bash util_scripts/run_gen_synthetic_data_with_bart.sh toy_data/train.en toy_data/valid.en
With the default setting, the noise
config=mask_0.0_0.6_random_0.0_0.3_insert_0.2_wholeword_0
. After this, a new directorybart_gen
is created under the directory of your input and you will see the output underbart_gen
. -
Create pseudo labels and binarize datasets
The examples scripts
./util_scripts/make_synthetic_data_mt.sh
and./util_scripts/make_synthetic_data_xsum.sh
are used for pseudo label creation and dataset binarization for machine translation and summarization respectively.You need to download the model you will finetune on later and the corresponding dictionaries prior to the following steps. To predict hallucination for a cross-lingual conditional sequence generation task, e.g. MT, you could use XLM-Roberta; to predict hallucination for a monolingual conditional sequence generation task, e.g. summarization, you could use Roberta.
These models also come along with the dictionaries and the subword models (sentencepiece for XLM-R, and gpt-2 bpe for Roberata). Following is an example processing script when finetuning XLM-R model:
bash ./util_scripts/make_synthetic_data_mt.sh config directory/of/target/data path/to/sentencepiece/model path/to/dictionary
e.g.,
bash util_scripts/make_synthetic_data_mt.sh mask_0.0_0.6_random_0.0_0.3_insert_0.2_wholeword_0 toy_data path/to/xlmr.large/sentencepiece.bpe.model path/to/xlmr.large/dict.txt
Similarly, you can run for Roberta model with example script
./util_scripts/make_synthetic_data_xsum.sh
. Please see the scripts for more details.After this step, you will see the binarized datasets with source, target, reference and labels under a new directory
data
underdirectory/of/target/data
.
You can finetune XLM-R or Roberta with the above created binarized data. We provide the batch scripts to run this for MT and abstractive summarization respectively.
sbatch ./train_exps/example_finetune_mt.sh path/to/the/binarized/data
or
sbatch ./train_exps/example_finetune_xsum.sh path/to/the/binarized/data
You may want to tune the hyperparameters inside the scripts for better performance, such as --dropout-ref (dropout reference words to prevent the model from learning edit distance), --max-update, etc.
We provide the evaluation scripts for the benchmark datasets under ./eval_data
.
To evaluate on these datasets, we provide python scripts ./util_scripts/eval_predict_hallucination_mt.py
and
./util_scripts/eval_predict_hallucination_xsum.py
for MT and summarization respectively (they only differ slightly).
First, you need to specify the path to the saved detection model directory and training data path in Line 12-13
, then run them.
You can download our trained models for these benchmark datasets for zhen-MT and XSum, and evalutate them with the above scripts by first setting the models
to be ['path/to/the/unzipped/folder']
and datapath
to be the folder of data inside the unzipped file.
To simply use the trained model for hallucination prediction for your own input, we provide an example script ./util_scripts/predict_hallucination_mt.py
that predicts labels for a hypothesis file conditioned on its source file.
Again, please specify the path to your input files, the trained model, the training data and the output directory in Line 12-23
, and then run it.
The directory word_level_qe/
contains scripts for both supervised and unsupervised experiments for word-level quality estimation from the WMT18 shared task (task 2 of QE).
@inproceedings{zhou21aclfindings,
title = {Detecting Hallucinated Content in Conditional Neural Sequence Generation},
author = {Chunting Zhou and Graham Neubig and Jiatao Gu and Mona Diab and Francisco Guzmán and Luke Zettlemoyer and Marjan Ghazvininejad},
booktitle = {Findings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP Findings)},
address = {Virtual},
month = {August},
url = {https://arxiv.org/abs/2011.02593},
year = {2021}
}