The directory contains trained models, diagnostic test sets and augmented training data for paper Factuality Checker is not Faithful: Adversarial Meta-evaluation of Factuality in Summarization
Six representative factuality checkers included in the paper are as follows:
- FactCC: the codes and original FactCC can be downloaded from https://github.com/salesforce/factCC. The four FactCCs trained with sub sampling and augmented data can be downloaded from here.
- Dae: the codes and trained model can be downloaded from https://github.com/tagoyal/dae-factuality.
- BertMnli, RobertaMnli, ElectraMnli: the codes are included in baseline and the trained models can be downloaded here.
- Feqa: the codes and trained model can be downloaded from https://github.com/esdurmus/feqa.
The table below represents the 6 factuality metrics and their model types as well as training datas.
Models | Type | Train data |
---|---|---|
MnliBert | NLI-S | MNLI |
MnliRoberta | NLI-S | MNLI |
MnliElectra | NLI-S | MNLI |
Dae | NLI-A | PARANMT-G |
FactCC | NLI-S | CNNDM-G |
Feqa | QA | QA2D,SQuAD |
The model type and training data of factuality metrics. NLI-A and NLI-S represent the model belongs to NLI-based metrics while defining facts as dependency arcs and span respectively. PARANMT-G and CNNDM-G mean the automatically generated training data from PARANMT and CNN/DailyMail.
The codes of adversarial transformations are in the directory of adversarial transformation. To make adversarial transformation, please run the following commands:
CUDA_VISIBLE_DEVICES=0 python ./adversarial_transformation/main.py -path DATA_PATH -save_dir SAVE_DIR -trans_type all
Change the DATA_PATH and SAVE_DIR to your own data path and save directory.
Six base evaluation datasets and four adversarial transformations are included in the paper.
- Base evaluation datasets
- DocAsClaim: Document sentence as claim.
- RefAsClaim: Reference summary sentence as claim.
- FaccTe: Human annotated evaluation set from Evaluating the Factual Consistency of Abstractive Text Summarization
- QagsC: Human annotated evaluation set from Asking and Answering Questions to Evaluate the Factual Consistency of Summaries
- RankTe: Human annotated evaluation set from Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference
- FaithFact: Human annotated evaluation set from On Faithfulness and Factuality in Abstractive Summarization
- Adversarial transformation
- Antonym Substitution
- Numerical Editing
- Entity Replacement
- Syntactic Pruning
Every adversarial transformation can be performed on the six base evaluation datasets, thus results in 24 diagnostic evaluation set. All base evaluation datasets and diagnostic evaluation sets can be found here. The detailed information for 6 baseline test sets and 24 diagnostic sets is shown in the table below :
Base Test Sets | Origin | Adversarial Transformation | ||||||
---|---|---|---|---|---|---|---|---|
Dataset type | Nov. | #Sys. | #Sam. | AntoSub | NumEdit | EntRep | SynPrun | |
DocAsClaim | CNNDM | 0 .0 | 0 | 11490 | 26487 | 25283 | 6816 | 9533 |
RefAsClaim | CNNDM | 77.7 | 0 | 10000 | 14131 | 11621 | 28758 | 4572 |
FaccTe | CNNDM | 54 | 10 | 503 | 670 | 515 | 440 | 245 |
QagsC | CNNDM | 28.6 | 1 | 504 | 711 | 615 | 539 | 351 |
RankTe | CNNDM | 52.5 | 3 | 1072 | 1646 | 1310 | 767 | 540 |
FaithFact | XSum | 99.2 | 5 | 2332 | 363 | 94 | 114 | 118 |
The detailed statistics of baseline (left) and diagnostic (right) test sets. For baseline test sets in the left, dataset type means the dataset that source document and summary belong to. Here, CNNDM means CNN/DailyMail dataset. Nov.(%) means the proportion of trigrams in claims that don't exist in source documents. #Sys. and #Sam. represent the number of summarization systems that the output summaries come from and the test set size respectively. For diagnostic test sets on the right, all cells mean the sample size of the sets.
The 140 samples that are misclassified by the FactCC are in the directory: data
The augmented training data can be downloaded here.