Authors: Liam van der Poel, Ryan Cotterell, Clara Meister
- To install fairseq on Euler server
- Experiments with the batch system
- Preparing XSUM dataset
- Train a Summarization model
- Evaluate a Summarization model
- Train a language model
- Evaluate a language model
- Hyperparameter selection
- Modifying fairseq
- XSum Hallucination Annotations
- Entropy threshold MMI decoding
Ensure any previously installed versions of fairseq are removed with python -m pip uninstall fairseq
. Then fork the repository. Instructions are based on general fairseq instructions and euler specific instructions.
git clone https://github.com/VanderpoelLiam/fairseq
module load gcc/6.3.0 python_gpu/3.8.5 hdf5 eth_proxy
python -m venv fair_env
source fair_env/bin/activate
cd fairseq
PYTHONPATH=$(which python)
$PYTHONPATH -m pip install --upgrade pip
$PYTHONPATH -m pip install --editable ./
Each time login to server need to run:
module load gcc/6.3.0 python_gpu/3.8.5 hdf5 eth_proxy
source fair_env/bin/activate
PYTHONPATH=$(which python)
I also tried to install apex (but it failed due to Cuda versions not matching):
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
--global-option="--deprecated_fused_adam" --global-option="--xentropy" \
--global-option="--fast_multihead_attn" ./
To run Euler in interactive mode with sufficient specs for most tasks:
bsub -I -n 2 -R "rusage[mem=2048]" -R "rusage[ngpus_excl_p=1]"
Download dataset of 237018 articles.
wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz
.
Then extract the bbc-summary-data
folder.
tar -xzf XSUM-EMNLP18-Summary-Data-Original.tar.gz
See here and here for more information.
wget https://github.com/EdinburghNLP/XSum/blob/master/XSum-Dataset/XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json
Ensure that the Moses tokeniser is installed with:
git clone https://github.com/moses-smt/mosesdecoder.git
See here for dependencies.
Follow instructions here to install fastBPE.
Based on files2rouge, do the following:
$PYTHONPATH -m pip install -U git+https://github.com/pltrdy/pyrouge
git clone https://github.com/pltrdy/files2rouge.git
cd files2rouge
$PYTHONPATH setup_rouge.py
$PYTHONPATH setup.py install
Run python src/preprocess.py
with the desired arguments to preprocess the data. See python src/preprocess.py -h
for more information on the possible datasets. Sample usage is python src/preprocess.py --all
to run the preprocessing for all the datatasets. The resulting data is stored in the data
directory.
The sample dataset consists of 4/2/2
examples in train/valid/test respectively. It is useful for debugging purposes to not work with the full XSUM dataset.
Log into server with sftp
. Ensure local and remote directories are the same then run put -r .
to move over all files. Do this for the following folder in the data
directory: xsum-summarizer
, xsum-lang
, xsum-lang-full
.
Based on this.
Use the src/train_command.py
script to generate the fairseq-train
commands and copy them to the clipboard. Paste the result on Euler to run. Basic usage is:
python src/train_command.py 1
to run experiment 1
and log to logs/experiments/train_1.log
.
Run python src/train_command.py -h
for more information on the other parameters.
An alternate approach using a BART pre-trained model is explained here and here
To download the pretrained model:
cd checkpoints/summarization_model/
wget https://dl.fbaipublicfiles.com/fairseq/models/bart.large.xsum.tar.gz
tar -xzvf bart.large.xsum.tar.gz
rm bart.large.xsum.tar.gz
cd ../..
Ensure the BART model preprocessing was run:
python src/preprocessing/preprocess.py --bart --full --sample
To generate with the pretrained model:
bsub -J generate_bart \
-o logs/experiments/generate_bart.log \
-W 60 -n 4 -R "rusage[mem=2048]" \
-R "rusage[ngpus_excl_p=1]" \
fairseq-generate \
data/xsum-summarizer-bart \
--path checkpoints/summarization_model/bart.large.xsum/model.pt \
--batch-size 8 --beam 5 \
--truncate-source \
--skip-invalid-size-inputs-valid-test \
--bpe gpt2 \
Similarly,
Use the src/generate_command.py
script to generate the fairseq-generate
commands. Basic usage is:
python src/generate_command.py 1 --wait_train
to run experiment 1
and log to logs/experiments/generate_1.log
.
Run python src/generate_command.py -h
for more information on the other parameters
Running src/score_generate.sh 1
extracts the target/hypothesis sentences from logs/experiments/generate_1.log
to the logs/rouge/generate_1
directory and removes tokenization and BPE. The full ROUGE scores are saved to the logs/rouge/generate_1/score
file and the F_1
scores output to the terminal.
Based on this.
Use the src/train_language_command.py
script to generate the fairseq-train
commands and copy them to the clipboard. Paste the result on Euler to run. Sample usage is:
python src/train_language_command.py \
--dataset standard \
--full \
--update_freq 32 \
--lr 0.0002
--restore
See src/train_language_command.py -h
for more details on the parameters.
Similarly,
Use the src/generate_command.py
script to generate the fairseq-generate
commands. Sample usage is:
python src/eval_language_command.py.py \
--dataset standard \
--full \
--update_freq 32 \
--lr 0.0002
--wait
See src/eval_language_command.py -h
for more details on the parameters.
\lambda
should take a value in [0, 1]
. I found the model would generate nonsense close to 1. My prior is therefore shifted towards zero. So I pick 10 values on a log scale in (0, 1)
(i.e. 2^-1, 2^-2, ..., 2^-9, 2^-10
).
\gamma
should take a value in [0, N_t]
where N_t
is the length of hypothesis t
. But this would lead to a dependence on t
. Instead I estimate M
, the average value of N_t
for all hypothesis generated by our summarizer (we count tokens not words). I then pick 10 values evenly in (0, M)
. M
was computed to be 26.90
so we select gamma from [2, 4, 7, 9, 12, 14, 17, 19, 22, 24]
.
\mu
scales the influence of log(N_t)
which is on average log(26.90) = 3.292
. Our prior is that we want all terms to be around the same order of magnitude. The average value of log(y|x)
is -1.102
, so picking \mu
to be one of (2^-1, 2^-2, ..., 2^-9, 2^-10
) feels appropriate.
ROUGE-1 RL scores are used as the validation metric.
Use the src/lambda_gridsearch_command.py
script to generate the batch command as for both train and generate. For example:
python src/lambda_gridsearch_command.py 8 --lang_full
runs model number 8
with the full language model. See python src/lambda_gridsearch_command.py -h
for help.
The logs are saved to logs/hyperparameters/lang_full/8
. In the log directory, we generate batch.log
which is the batch command logs. Then, for each lambda of value x
, we create two files: x.log
, the results of fairseq-generate
and score_x
the corresponding rouge scores.
By default we generate on the test set with fairseq-generate
. However for hyperparameter selection we need to generate on the validation set. This is done by adding the parameter --gen-subset "valid"
to fairseq-generate
.
I created a new branch called liam
to make modifications to fairseq. I use the lstm
and lstm_lm
architectures as well as the xsum-summarizer-samples
and xsum-lang-samples
datasets. The architectures and datasets are chosen to be fast to train and generate with. We also create the test-project
directory with the following structure:
test-project
├── checkpoints
│ ├── summarization_dummy/
│ └── lang_dummy/
└── data
├── xsum-summarizer-samples/[...]
└── xsum-lang-samples/[...]
Where the xsum-*
directories contain all the data copied over from master-thesis/
.
Then to initialize the models, we train them for one iteration. For the summarization model run:
bsub -I -W 10 -n 4 -R "rusage[mem=4096]" \
fairseq-train data/xsum-summarizer-samples \
--arch lstm \
--save-dir checkpoints/summarization_dummy \
--optimizer adam --lr 0.005 --lr-shrink 0.5 \
--max-tokens 4096 \
--max-update 1 \
--max-epoch 1
and for the language model run:
bsub -I -W 10 -n 4 -R "rusage[mem=4096]" \
fairseq-train data/xsum-lang-samples \
--task language_modeling \
--arch lstm_lm \
--save-dir checkpoints/lang_dummy \
--optimizer adam --lr 0.005 --lr-shrink 0.5 \
--max-tokens 4096 \
--max-update 1 \
--max-epoch 1
Then to generate with \lambda = 1
run:
bsub -I -W 10 -n 4 -R "rusage[mem=4096]" -R "rusage[ngpus_excl_p=1]" \
fairseq-generate \
data/xsum-summarizer-samples \
--gen-subset "valid" \
--path checkpoints/summarization_dummy/checkpoint_best.pt \
--batch-size 16 --beam 5 --truncate-source \
--skip-invalid-size-inputs-valid-test \
--lm-path checkpoints/lang_dummy/checkpoint_best.pt \
--lm-weight -1
With an unmodified fairseq-generate
this produces for a single article and output of the form (I add [...]
for brevity):
S-1 Transport Minister Juan Mol@@ in@@ ar said [...]
T-1 One of Mexico 's biggest airlines , Mex@@ ic@@ ana de [...]
H-1 5.0183587074279785 roadside Venezuel@@ released Venezuel@@ [...]
D-1 5.0183587074279785 roadside Venezuel@@ released Venezuel@@ [...]
P-1 2.4488 2.6601 2.9603 3.0198 3.1490 3.3097 3.4029 3.4311 [...]
The MMI decoding objective has the form: $\log p(y | x) - \lambda \log p(y)$
.
P-1
is an array where the i
'th entry P-1[i]
corresponds to: $\log p(y_i | x, y_{<i}) - \lambda \log p(y_i | y_{<i})$
. The sum of these two probability distributions is then normalized. Hence it is not the case that P-1 = P_SM-1 + $\lambda$ P_LM-1
Our modifications produce two additional arrays, P_SM
and P_LM
which correspond to $\log p(y | x)$
and $\log p(y)$
respectively. Therefore, P_SM-1[i]
corresponds to $\log p(y_i | x, y_{<i})$
and P_LM-1[i]
corresponds to $\log p(y_i | y_{<i})$
.
This looks like:
[...]
P-1 2.4488 2.6601 2.9603 3.0198 3.1490 3.3097 3.4029 3.4311 [...]
P_SM-1 -8.3416 -8.3461 -7.9928 -7.9528 -7.8088 -7.7310 -7.6555 [...]
P_LM-1 -10.7904 -11.0063 -10.9531 -10.9726 -10.9579 -11.0407 [...]
As a sanity check we can see that for i=0
, we have 2.4488 = P-1[0] = P_SM-1[0] - 1 * P_LM-1[0] = -8.3416 - 1 * -10.7904 = 2.4488
.
The sequence_scorer.py
and generate.py
files in fairseq were modified to compute token level entropy values on the reference summaries. Running:
fairseq-generate \
data/xsum-summarizer-samples \
--gen-subset "valid" \
--path checkpoints/summarization_model/11/checkpoint_best.pt \
--batch-size 16 --beam 5 \
--score-reference \
--truncate-source \
--lm-path checkpoints/lang_full/checkpoint_best.pt \
--lm-weight -1
should result in an output of the form:
[...]
T-1 One of Mexico 's biggest airlines , Mex@@ ic@@ ana de Avi@@ ac@@ [...]
H-1 -13.054492950439453 One of Mexico 's biggest airlines , Mex@@ [...]
P-1 -16.4506 -1.5589 -8.1095 -1.8594 -8.5545 -13.3892 -7.5062 -17.5240 [...]
P_SM-1 -6.4942 -0.1495 -0.7367 -0.0593 -1.9511 -1.6154 -1.9322 -4.4989 [...]
P_LM-1 -11.6483 -19.5079 -18.5996 -14.2160 -12.6121 -10.2308 -10.8821 [...]
ENT_LANG-1 3.3440 1.9895 0.0965 5.3574 6.1091 2.9450 7.1573 0.2616 [...]
ENT-1 5.9865 1.0240 2.1275 0.4982 3.9551 3.9361 2.8885 7.6781 2.3818 [...]
For the generated summaries run:
fairseq-generate \
data/xsum-summarizer-samples \
--gen-subset "valid" \
--path checkpoints/summarization_model/11/checkpoint_best.pt \
--batch-size 16 --beam 5 \
--truncate-source \
--lm-path checkpoints/lang_full/checkpoint_best.pt \
--lm-weight -1
should result in an output of the form:
[...]
H-1 16.198698043823242 an@@ L@@ El@@ P@@ de ana airline Mex@@ W@@ Dor@@[...]
P-1 -42.1291 -14.4789 -19.7052 -12.0393 -25.7978 -12.9992 -19.6963 [...]
P_SM-1 -12.4465 -10.1111 -8.1167 -9.6915 -11.2029 -8.1613 -9.7413 [...]
P_LM-1 -2.0259 -19.8412 -19.6330 -20.9502 -4.7277 -13.5809 -12.2417 [...]
ENT_LANG-1 6.1471 8.4263 7.7026 8.7397 4.8474 4.7859 4.6337 0.8963 [...]
ENT-1 8.0262 7.6571 6.7409 7.1623 5.3124 3.4494 7.7405 6.1100 6.6523 [...]
At at each position in the sequence we select a particular token to output, not necessarily the token with highest probability. The ranking of a token is the ranking of its probability with respect to the distribution at our current position. E.g. if we pick the 3rd most probable token then it has rank 3.
We modify the sequence_scorer.py
and generate.py
files in fairseq to compute token level rankings:
fairseq-generate \
data/xsum-summarizer-samples \
--gen-subset "valid" \
--path checkpoints/summarization_model/standard/checkpoint_best.pt \
--batch-size 16 --beam 5 \
--score-reference \
--truncate-source
The output should now contain a line of the form:
[...]
T-1 One of Mexico 's biggest airlines , Mex@@ ic@@ ana de Avi@@ ac@@ [...]
H-1 -3.9946937561035156 One of Mexico 's biggest airlines , Mex@@ [...]
P-1 -6.4942 -0.1495 -0.7367 -0.0593 -1.9511 -1.6154 -1.9322 -4.4989 [...]
RANK-1 7882 3780 13751 111 3154 6732 38 9051 13328 2859 10583 6880 17216 [...]
As a sanity check, we shift the tokens by 1 in either direction to see that this leads to lower probabilites/worse rankings. For example, with the sequence [BOS, I, went, home, EOS]
we calculate the probability/ranking of went
using the 1st
distribution over tokens. This sanity check looks at what happens when we use the 0th
distribution and the 2nd
distribution.
Shifting tokens by +1
:
[...]
P-1 -21.2402 -16.3707 -11.4401 -16.5040 -18.6815 -8.1419 -13.4884 [...]
RANK-1 25465 807 21394 4734 210 15289 3960 7574 30588 3537 170 47495 [...]
Shifting tokens by -1
:
P-1 -15.8094 -15.9375 -20.9515 -9.2937 -9.1373 -12.4901 -12.4808 -6.1468 [...]
RANK-1 307 578 30343 472 8405 42258 12578 29894 5758 26202 34352 26376 [...]
If we consider the token Mexico
:
token shift | -1 |
0 |
+1 |
---|---|---|---|
Log probability | -20.9515 | -0.7367 | -11.4401 |
Ranking | 30343 | 13751 | 21394 |
Based on the statistics for the first hallucinated token in a sequence, we can see that high entropy is correlated with the start of a hallucination. So the idea is modify fairseq to have entropy above a threshold trigger MMI decoding.
For actual generation we need to modify the test set to remove the 500 test examples that were used to pick the threshold value to avoid information leakage. This is the dataset xsum-summarizer-no-500
.
The below code snippet is to check that the thresholding is behaving as expected, only triggering when the token entropy is higher than the threshold:
fairseq-generate \
data/xsum-summarizer-samples \
--path checkpoints/summarization_model/standard/checkpoint_best.pt \
--batch-size 16 --beam 5 \
--gen-subset "valid" \
--truncate-source \
--log-format none \
--lm-path checkpoints/language_model/standard/checkpoint_best.pt \
--lm-weight -100 \
--ent-threshold 4
Data from Maynez et al. here.
Zhou et al. already postprocessed the data from Maynez et al. here. This data is stored under data/xsum-hallucination-raw/
:
data
└── xsum-hallucination
├── Gold.docid
├── Gold.label
├── Gold.ref
├── Gold.source
└── Gold.target
The labels in Gold.label
are for the summaries in Gold.target
. However we want labels for the result of the Gold.ref
summaries after applying tokenisation and BPE--log-interval 10000000000 . This is done as follows (base directory for scripts is src/hallucination_labelling/
):
- Create
data/Xsum-hallucination-split.json
containing all the id's fromdata/xsum-hallucination-raw/Gold.docid
in the test split and nothing in the train/val splits. - Run the preprocessing on this split to get all the
test.*
files and the twodict.*
files indata/xsum-hallucination/
. We want labels for the sentences intest.bpe.target
. - Next, run
align_labels.py
to get the labels for most oftest.bpe.target
and save totest.label
. These labels are extracted fromGold.label
and aligned withGold.target
. - Missing labels are indicated by a
?
and these cases are processed manually with a helper in the same script. - I found that cases where
1 0 1
appeared where often mistakes, so I also process these cases manually
For the BART dataset in data/xsum-hallucination-bart
, the issue is that test.bpe.target
is encoded hence looks like:
3198 2776 837 838 11514 3869 28057 764 [...]
So we need to decode these numbers using the encoding in data/bart.encoder.json
. This is done by running the following in python:
import json
import os
with open('data/bpe/bart/encoder.json') as f:
encoder = json.load(f)
encoder_keys = list(encoder.keys())
encoder_values = list(encoder.values())
base_dir = "data/xsum-hallucination-bart/"
filename = base_dir + "test.bpe.target"
with open(filename) as f:
lines = [line.rstrip() for line in f]
new_filename = base_dir + "test.bpe.target.encoded"
os.rename(filename, new_filename)
new_lines = []
for line in lines:
tokens = list(map(int, line.split()))
inds = list(map(encoder_values.index, tokens))
decoded_toks = [encoder_keys[i] for i in inds]
cleaned_toks = []
for tok in decoded_toks:
if (tok == "\u0120"):
cleaned_toks.append(tok)
else:
clean_tok = tok.replace("\u0120","")
cleaned_toks.append(clean_tok)
new_line = " ".join(cleaned_toks)
new_lines.append(new_line)
with open(filename, 'a') as f:
print(new_line, file=f)
Then follow the same instructions as above for alignment.
For all the following scripts we need to specify which dataset we are working with [standard, bart]
using the --dataset
parameter. by default we use standard
Run:
python src/hallucination_labelling/compute_token_level_entropy_command.py
and paste the results on Euler to compute the token level entropy values and log them to logs/experiments/token_level_entropy_standard
Then run:
python src/hallucination_labelling/align_data.py
in order to extract the entropy scores to test.entropy.sm
and test.entropy.lm
and the probability scores to test.prob.sm
and test.prob.lm
in the directory data/xsum-hallucination/
. This data is now aligned with the test.label
hallucination labels in the same directory.
Given the data in test.label
, test.entropy.sm
and test.entropy.lm
in data/xsum-hallucination/
we want to generate statistics on this data.
Run python src/hallucination_labelling/entropy_stats.py
in order to get statistics and distribution information comparing the entropy values for various token labellings.
Likewise we want statistics on test.prob.sm
and test.prob.lm
.
Run src/hallucination_labelling/probability_stats.py
in order to get statistics and distribution information.
Run src/hallucination_labelling/token_stats.py
to get information on the distribution of tokens with label initial hallucinated
We assume that our trained models in checkpoints
have the following structure:
checkpoints
├── language_model
│ ├── standard
│ │ └── checkpoint_best.pt
│ └── bart
│ └── checkpoint_best.pt
└── summarization_model
├── standard
│ └── checkpoint_best.pt
└── bart
└── checkpoint_best.pt
The two hyperparemeters we want to select are ent_threshold
, the influence of the language model and the entropy threshold for the summarization model at which we trigger MMI decoding respectively. The goal is to pick the optimal parameter combination such that we minimizs the average log probability of Initial Hallucinated tokens (i.e. P = P_SM +
The source directory for the following scripts is src/hallucination_labelling/
. The generate_lambdas.py
, and the ent_threshold
candidates by generate_thresholds.py
. For each parameter combination we do the following:
- Run
fairseq-generate
as seen in the token level entropy section to score the references and get the average log probability of Initial Hallucinated tokens - Run
fairseq-generate
but without the--score-reference
parameter to generate hypothesis sentences, and compute the ROUGE score - This gives 2 values, a log probability and a ROUGE score. Add this point to a plot. Our goal is to pick parameters that maximizes ROUGE and minimizes log probability.
The script param_search_command.py
with and without the --score_ref
parameter generates the commands to run the above fairseq-generate
experiments on the server. Then due to limitations of the server, we have to run separate postprocessing on the raw results on both server and client side:
- Run
process_param_search.py
on the server - Run
process_param_search.py --local
locally. - Run
plot_param_search.py
to get the resulting plots. (Additional parameters can be provided toparam_search_command.py
, the same parameters should also be passed toprocess_param_search.py
andplot_param_search.py
)
The test set has the 500 labelled hallucination test examples removed. For a given lambda/threshold parameter pair:
- Run
src/generate_no_500_command.py lambda threshold
- Run
src/generate_no_500_command.py lambda threshold --score_ref
(with appropriate additional parameters) to both generate summaries and score the reference summaries.
We use the automatic factuality detection as described in the paper which is implemented in the associated repository. To install the package, I first fork the repository then run the following:
git clone https://github.com/VanderpoelLiam/fairseq-detect-hallucination.git
python -m venv detect_hall
source detect_hall/bin/activate
python -m pip install --upgrade pip
cd fairseq-detect-hallucination/
pip install wheel
pip install -U git+https://github.com/ddkang/loss_dropper.git
pip install --editable ./
Then to check everything is correctly installed, we run a hallucination evaluation script using the trained XSum model:
mkdir models
cd models/
<!-- This takes a while -->
wget https://dl.fbaipublicfiles.com/detect-hallucination/xsum.roberta.tar.gz
Then we need to modify lines 12-13 of util_scripts/eval_predict_hallucination_xsum.py
to:
models = ["models/xsum.roberta.tar.gz"]
datapath = "models/xsum.roberta.tar.gz/data"
and can then run the evaluation with:
python util_scripts/eval_predict_hallucination_xsum.py
This should produce the following output:
models/xsum.roberta.tar.gz
Loaded the model!
use ref = 0
TranS2S
Processed 100 lines!
Processed 200 lines!
Processed 300 lines!
Processed 400 lines!
Processed 500 lines!
Percentage of hallucination tokens = 0.5571630588491243, gold = 0.4674208637006418
Sentence-level F1: 0.9594172736732571
Sentence-level hallucination percentage (gold) = 0.924
Spearman-corr by token: 0.32839376258434905
Spearman-corr by probs: 0.327952862600306
0.5642327215931277 0.6725622527344659 0.6136532540609407 0.922 0.32839376258434905 0.327952862600306 0.6041553355814206
use ref = 1
TranS2S
Processed 100 lines!
Processed 200 lines!
Processed 300 lines!
Processed 400 lines!
Processed 500 lines!
Percentage of hallucination tokens = 0.5867507886435331, gold = 0.4674208637006418
Sentence-level F1: 0.9604989604989606
Sentence-level hallucination percentage (gold) = 0.924
Spearman-corr by token: 0.3156002998185347
Spearman-corr by probs: 0.32043934981785943
0.5587690025954765 0.7014195950663253 0.6220204313280364 0.924 0.3156002998185347 0.32043934981785943 0.6028499945610791
We want to obtain hallucination labels for the reference summaries for a particular preprocessing of the xsum-no-500
dataset (e.g. standard or bart). The reference labelling is the same for any lambda/ent_threshold
pair, we just need to specify one of generated standard_ref_no_500_*
log files.
Take the example of the bart
preprocessed dataset, where we generated bart_ref_no_500_6.5602E-02_3.5987E+00
(i.e. lambda = 6.5602E-02
and ent_threshold = 3.5987E+00
). To get the desired labelling label_processed_bart_ref
:
- Run
src/preprocessing/preprocess --detect_hall
. To get the source filedata/xsum-detect-hall/source
. It is the same for bothstandard
andbart
preprocessed datasets. - Run
src/detect_hallucination/detect_hypothesis_hallucinations_command.py 6.5602E-02 3.5987E+00 --score_ref --dataset bart
to generate the batch command. Then paste this on the server. It is not always possible to process the labels successfully. However the failure percentage is quite lowTODO/%
for this particular example, and failed results are indicated byFAIL
at that line in thelabel_processed_bart_ref
file.
The data directory is data/xsum-detect-hall/
. The labeling requires 3 files:
source
: The raw source text.hypo_bart_6.5602E-02_3.5987E+00
: The hypothesis sentences with tokenization and bpe removed.hypo_processed_bart_6.5602E-02_3.5987E+00
: The hypothesis sentences as produced by the model.
and the resulting file is:
label_processed_bart_ref
: The predicted hallucination labeling ofhypo_processed_bart_6.5602E-02_3.5987E+00
(but this would be the same for any choice oflambda/ent_threshold
as the reference targets are the same for any parameter pair)
Assume we are working with the standard
preprocessed dataset with lambda = 1.3120E-01
and ent_threshold = 3.5618E+00
. In previous steps we did the following:
- Determined optimal hyperparameters (e.g.
standard
,lambda = 1.3120E-01
andent_threshold = 3.5618E+00
). - Run
src/generate_no_500_command.py 1.3120E-01 3.5618E+00
andsrc/generate_no_500_command.py 1.3120E-01 3.5618E+00 --score_ref
(with appropriate additional parameters) in order to generatelogs/experiments/standard_no_500_1.3120E-01_3.5618E+00
andlogs/experiments/standard_no_500_ref_1.3120E-01_3.5618E+00
. These log files contain information about token level entropy, probability and ranking under both the summarization and language models for generated text and reference summaries respectively - Run hallucination prediction scripts as described in the previous section to generate
label_processed_standard_ref
that is the token level hallucination labels for the reference summaries.
Remaining is to evaluate the performance of our decoding method compared to the default (lambda = ent_threshold = 0
). The metrics we extract are:
- ROUGE scores, BERTScores
- Average log probability by token label
- Average ranking by token label
We are looking to see that our decoding method does not substantially decrease ROUGE scores for the generated text, and both average log probability and ranking are lower for initial hallucinated tokens from the reference summaries.
Run python src/evaluate_decoding.py 1.3120E-01 3.5618E+00
(with appropriate additional parameters) to generate the various metrics.
The general instructions for preparing the datasets are given here
Useful links:
- https://github.com/facebookresearch/fairseq/blob/fcca32258c8e8bcc9f9890bf4714fa2f96b6b3e1/examples/bart/README.summarization.md
- https://github.com/abisee/cnn-dailymail
Download the stories
files from here to the data/
directory under data/cnn
and data/dailymail
.
We download make_datafiles.py and the url_liststo preprocess this raw data for summarization tasks.
make_datafiles.py
was then modified to work with our existing preprocessing code. Then run:
python src/preprocessing/make_datafiles.py data/cnn/stories data/dailymail/stories
We create a smaller version of the same dataset consisting of only 10 samples using:
for SPLIT in train test valid; do
for LANG in source target; do
head -10 data/cnn-dm-summarizer/$SPLIT.$LANG.txt >> data/cnn-dm-summarizer-samples/$SPLIT.$LANG.txt;
done;
done
This gives us the full dataset in data/cnn-dm-summarizer/
and the 10 samples in data/cnn-dm-summarizer-samples/
.
As before, run python src/preprocess.py --cnn
with the desired additional arguments to preprocess the data. See python src/preprocess.py -h
for more information on the possible datasets.
We download the associated BART model finetuned on CNN-DM to checkpoints/summarization_model/bart.large.cnn
from here