(Yiran Chen*, Pengfei Liu*, Ming Zhong, Zi-Yi Dou, Danqing Wang, Xipeng Qiu, Xuanjing Huang)
Many work evaluate summarization systems on in-domain setting (the model is trained and tested on the same dataset). In this work we try to understand model performance on different perspectives on a cross-dataset setting. The picture blow represents the main motivation (summarization systems get different rankings when evaluated under different measures where abstractive models are red while extractive ones are blue):
Q1: How do different neural architectures of summarizers influence the cross-dataset generalization performances?
Q2: Do different generation ways (extractive and abstractive) of summarizers influence the cross-dataset generalization ability?
- Datasets
- Semantic Equivalenc (ROUGE)
- Factuality (Factcc)
- Dataset bias (Detailed explanation is displayed in our paper and the code can refer to Data-bias-metrics/)
- Coverage
- Copy length
- Novelty
- Repetition
- Sentence fusion score
- Stiffness
: the metric score when model is trained on dataset i and tested on dataset j. - Stableness
: the metric score when model is trained on dataset i and tested on dataset j.
The stiffness and stableness of various summarizers are displayed below. For fine-grained results and comprehensive analysis please refer to the paper.