PrefScore

Code for PrefScore: Pairwise Preference Learning for Reference-free Summarization Quality Assessment

Requirements

Files

pre/            Codes for negative sampling
human/          Codes for human evaluation
config.py       Config file for folder and training settings
model.py        Script for training the models
evaluate.py     Evaluate the trained models on target datasets

Negative sampling (Preprocess)

Code for generating negative samples are in pre folder.

cd pre
python3 ordered_generation.py

Edit pre/sentence_conf.py to change negative sampling settings.

Training

Run python3 model.py -h for full command line arguments.

Example (Training on the preprocessed billsumm dataset):

python3 model.py --dataset billsum

Evaluating

To evaluate the trained model on newsroom, realsumm or tac2010, go to human/ folder for detailed instructions to get the processed files:

human/newsroom/newsroom-human-eval.csv
human/realsumm/realsumm_100.tsv
human/tac/TAC2010_all.json

Run python3 evaluate.py -h for full command line arguments.

Example (Evaluate the model trained from billsumm on newsroom):

python3 evaluate.py --dataset billsum --target newsroom

Alignment with human evaluations

Code for computing the correlation between our models' predictions and human ratings from the three datasets is in the human folder.

Misc

To evaluate on a custom dataset, format the dataset as a tsv file where each line starts with a document and followed by serveral summaries of the document separated by '\t'. See example.tsv for example.
Example for use the metric in a script:

import torch
import config as CFG
from model import Scorer
from evaluate import evaluate
# CKPT_PATH is the path of a pretrained pth model file
scorer = Scorer()
scorer.load_state_dict(torch.load(CKPT_PATH, map_location=CFG.DEVICE))
scorer.to(CFG.DEVICE)
scorer.eval() 
# Test example
docs = ["This is a document.", "This is another document."]
sums = ["This is summary1", "This is summary2."]
results = evaluate(docs, sums, scorer)

Cite

@inproceedings{luo-etal-2022-prefscore,
    title = "{P}ref{S}core: Pairwise Preference Learning for Reference-free Summarization Quality Assessment",
    author = "Luo, Ge  and
      Li, Hebi  and
      He, Youbiao  and
      Bao, Forrest Sheng",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.515",
    pages = "5896--5903",
    abstract = "Evaluating machine-generated summaries without a human-written reference summary has been a need for a long time. Inspired by preference labeling in existing work of summarization evaluation, we propose to judge summary quality by learning the preference rank of summaries using the Bradley-Terry power ranking model from inferior summaries generated by corrupting base summaries. Extensive experiments on several datasets show that our weakly supervised scheme can produce scores highly correlated with human ratings.",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrefScore

Requirements

Files

Negative sampling (Preprocess)

Training

Evaluating

Alignment with human evaluations

Misc

Cite

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
human		human
pre		pre
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
evaluate.py		evaluate.py
example.tsv		example.tsv
model.py		model.py

License

NKWBTB/PrefScore

Folders and files

Latest commit

History

Repository files navigation

PrefScore

Requirements

Files

Negative sampling (Preprocess)

Training

Evaluating

Alignment with human evaluations

Misc

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages