This repository contains the SubSumE dataset for subjective document summarization. See the paper and the talk for details on dataset creation. Also check out our work SuDocu on example-based document summarization.
Download the dataset from here.
The dataset contains :
- Simplified text from 48 Wikipedia pages of the states in the US. Additionally, all the sentences in these documents
are put together in a single file
processed_state_sentences.csv
and are assigned a unique sentence id that is used in summary json files. - Intent-based summaries created by human annotators.
Each datapoint file in the directory user_summary_jsons
contains a json containing summaries of Wikipedia pages
of eight states with following keys:
- intent : Summarization intent provided to human annotators for generating the summary
- summaries: List of summary jsons for eight states assigned to the annotator. Each json in the list contains following keys:
- state_name: Name of the state
- sentence_ids: Global ids of sentences (wrt
processed_state_sentences.csv
) present in the summary - sentences: List of sentences present in the summary
- use_keywords: Keywords used by the annotator to search the document when creating summaries
This work was supported by the NSF under grants IIS-1453543, IIS1943971, and CCF-1763423, and a Microsoft Research Dissertation Grant.