This is the official code repository for Improving Factual Consistency in Summarization with Compression-Based Post-Editing by Alexander R. Fabbri, Prafulla Choubey, Jesse Vig, Chien-Sheng Wu, and Caiming Xiong.
We present code and instructions for running inference from the models introduced in the paper.
Post-editing is a model-agnostic approach to improve summary quality. In our paper, we propose a compression-based method that post-edits summaries to improve factual consistency.
Our model takes as input the source document along with an initial summary with entities not found in the source according to named-entity recognizition marked with special tokens.
The model produces a compressed output with these entities removed, improving entity precision with respect to the input by up to 25% while retaining informativeness.
For training the perturber model, we use the data from the paper Overcoming the Lack of Parallel Data in Sentence Compression found here.
We provide the script ./preprocess_perturber.py
for processing this data in a format suitable for model training.
Examples are filtered so the compressed sentence length is at least 75% that of the uncompressed sentence.
We provide the script ./train.sh
for model training, which makes use of the following files: train.py and select_checkpoints.py.
Our trained perturber checkpoint can be found here.
We provide the script ./preprocess_posteditor.py
that takes in a folder containing {subset}.source
and {subset}.target
dataset files and prepares the input to apply the perturber on.
The data should first be filtered according to entity precision (see below).
Then, run ./generate.py
on the output of this preprocessing step to produce training data for the post-editor.
The script truncate_csv.py
should be run on the above csv file to ensure that model summaries that we learn to post-edit are not truncated.
The above ./train.sh
can be used for training the post-editor on the output of the above scripts. Our trained post-editor checkpoint for XSum can be found here and for CNN/DM here.
The code below from run.py
shows how to use the pretrained models:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from generate import get_gen
sentence_to_perturb = "It was the season in which Chelsea played by their own record book. || 18 | the Champions League"
SRC = "The Blues topped the table while, at the other end, Sunderland could not escape the drop after 110 straight days at the bottom. Records tumbled through to the last day of the campaign, when we saw 33 goalscorers, more than ever before in a single day of a 38-game season. Goals scored from outside the penalty area fell to a Premier League low of 11.6% so, if you like a goalmouth scramble, this was your year."
sentence_to_postedit = f"{SRC} </s> It was the ## 18 ## th consecutive season in which Chelsea played by their own record book in ## the Champions League ##."
perturb_model_name = "PATH_TO_PERTURBER"
posteditor_model_name = "PATH_TO_POSTEDITOR"
gen_args = {"max_enc_len": 1024, "device": "cuda", "length_penalty": 1.0, "num_beams": 6, "min_gen_len": 10, "max_gen_len": 60}
perturber_tok = AutoTokenizer.from_pretrained(perturb_model_name)
perturber_model = AutoModelForSeq2SeqLM.from_pretrained(perturb_model_name).to(gen_args["device"])
posteditor_tok = AutoTokenizer.from_pretrained(posteditor_model_name)
posteditor_model = AutoModelForSeq2SeqLM.from_pretrained(posteditor_model_name).to(gen_args["device"])
perturbed_output = get_gen([sentence_to_perturb], perturber_model, perturber_tok, gen_args)[0]
print(perturbed_output)
#It was the 18th season in which Chelsea played by their own record book in the Champions League.
postedited_output = get_gen([sentence_to_postedit], posteditor_model, posteditor_tok, gen_args)[0]
print(postedited_output)
#It was the season in which Chelsea played by their own record book.
See ./entity_score.py
for entity precision and recall calculations.
When referencing this repository, please cite this paper:
@misc{fabbri-etal-2022-improving,
title={Improving Factual Consistency in Summarization with Compression-Based Post-Editing},
author={Alexander R. Fabbri and Prafulla Choubey and Jesse Vig and Chien-Sheng Wu and Caiming Xiong,
year={2022},
eprint={2211.06196},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2211.06196}
}
This repository is released under the BSD-3 License.