Code for PrefScore: Pairwise Preference Learning for Reference-free Summarization Quality Assessment
- Tensorflow and Tensorflow datasets
- Pytorch
- Hugginface transformers
- Spacy
- Scipy
- SummEval (Optional)
- NLTK (Optional)
pre/ Codes for negative sampling
human/ Codes for human evaluation
config.py Config file for folder and training settings
model.py Script for training the models
evaluate.py Evaluate the trained models on target datasets
Code for generating negative samples are in pre
folder.
cd pre
python3 ordered_generation.py
Edit pre/sentence_conf.py
to change negative sampling settings.
Run python3 model.py -h
for full command line arguments.
Example (Training on the preprocessed billsumm dataset):
python3 model.py --dataset billsum
To evaluate the trained model on newsroom, realsumm or tac2010, go to human/
folder for detailed instructions to get the processed files:
human/newsroom/newsroom-human-eval.csv
human/realsumm/realsumm_100.tsv
human/tac/TAC2010_all.json
Run python3 evaluate.py -h
for full command line arguments.
Example (Evaluate the model trained from billsumm on newsroom):
python3 evaluate.py --dataset billsum --target newsroom
Code for computing the correlation between our models' predictions and human ratings from the three datasets is in the human
folder.
-
To evaluate on a custom dataset, format the dataset as a tsv file where each line starts with a document and followed by serveral summaries of the document separated by
'\t'
. Seeexample.tsv
for example. -
Example for use the metric in a script:
import torch
import config as CFG
from model import Scorer
from evaluate import evaluate
# CKPT_PATH is the path of a pretrained pth model file
scorer = Scorer()
scorer.load_state_dict(torch.load(CKPT_PATH, map_location=CFG.DEVICE))
scorer.to(CFG.DEVICE)
scorer.eval()
# Test example
docs = ["This is a document.", "This is another document."]
sums = ["This is summary1", "This is summary2."]
results = evaluate(docs, sums, scorer)
@inproceedings{luo-etal-2022-prefscore,
title = "{P}ref{S}core: Pairwise Preference Learning for Reference-free Summarization Quality Assessment",
author = "Luo, Ge and
Li, Hebi and
He, Youbiao and
Bao, Forrest Sheng",
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2022.coling-1.515",
pages = "5896--5903",
abstract = "Evaluating machine-generated summaries without a human-written reference summary has been a need for a long time. Inspired by preference labeling in existing work of summarization evaluation, we propose to judge summary quality by learning the preference rank of summaries using the Bradley-Terry power ranking model from inferior summaries generated by corrupting base summaries. Extensive experiments on several datasets show that our weakly supervised scheme can produce scores highly correlated with human ratings.",
}