The following repository contains the corpus that was created for the publication 'Beyond Generic Summarization: A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data' as well as the annotation tool that was developed for that purpose and an example Amazon Mechanical Turk HIT .
Included in the Corpus
folder is the following:
Inlcuded in the SourceDocuments
folder are the .xml files of all source topics and a .txt file with the topic names.
Included in the AMTAllNuggets
folder is a tab-delimited csv file with all annotations from Amazon Mechanical Turk in the format worker [tab] annotation. The turker IDs have been hashed in order to anonymize them.
Included in the Trees
folder are the inout documents for the tree annotation, the trees from three annotators as well as the gold standard trees created out of these trees.
Included in the AnnotationTool
folder is the Annotation tool as a Java archive as well as the source code and documentation of the tool.
Included in the HIT-Template
folder is an example HIT along with the javascript and stylesheet.
If you find the corpus and/or annotation tool useful, please cite the following paper: Beyond Generic Summarization: A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data
@inproceedings{Tauchmann.et.al.2018.LREC,
author = {Tauchmann, Christopher and Arnold, Thomas and Hanselowski, Andreas and Meyer, Christian M. and Mieskes, Margot},
title = {Beyond Generic Summarization: A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data},
booktitle = {Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC)},
month = {May},
year = {2018},
pages = {3184--3191},
location = {Miyazaki, Japan},
url = {http://www.lrec-conf.org/proceedings/lrec2018/pdf/252.pdf}
}
Abstract: Automatic summarization has so far focused on datasets of ten to twenty rather short documents, typically news articles. But automatic systems could in theory analyze hundreds of documents from a wide range of sources and provide an overview to the interested reader. Such a summary would ideally present the most general issues of a given topic and allow for more in-depth information on specific aspects within said topic. In this paper, we present a new approach for creating hierarchical summarization corpora from large, heterogeneous document collections. We first extract relevant content using crowdsourcing and then ask trained annotators to order the relevant information hierarchically. This yields tree structures covering the specific facets discussed in a document collection. Our resulting corpus is freely available. It can be used to develop and evaluate hierarchical summarization systems.
Contact person: Christopher Tauchmann, tauchmann@ukp.informatik.tu-darmstadt.de
https://www.aiphes.tu-darmstadt.de/
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
- The annotations are licensed under the Creative Commons CC-BY 4.0 license.
- The original content from ClueWeb12 keeps its original license.
- The annotation tool is licensed under the GNU GNU General Public License v3.0.