This repository is the pytorch implementation of paper
DeepChannel: Salience Estimation by Contrastive Learning for Extractive Document Summarization.
Jiaxin Shi*, Chen Liang*, Lei Hou, Juanzi Li, Zhiyuan Liu, Hanwang Zhang.
In this paper, we propose DeepChannel, a robust, data-efficient, and interpretable neural model for extractive document summarization. Given any document-summary pair, we estimate a salience score, which is modeled using an attention-based deep neural network, to represent the salience degree of the summary for yielding the document. We devise a contrastive training strategy to learn the salience estimation network, and then use the learned salience score as a guide and iteratively extract the most salient sentences from the document as our generated summary.
If you find this code useful in your research, please cite
@Inproceedings{Shi2018DeepChannel,
title={DeepChannel: Salience Estimation by Contrastive Learning for Extractive Document Summarization},
author={Jiaxin Shi, Chen Liang, Lei Hou, Juanzi Li, Zhiyuan Liu, Hanwang Zhang},
booktitle = {AAAI},
year = {2019}
}
- python==3.6
- pytorch==1.0.0
- spacy
- nltk
- pyrouge & rouge
Before training the model, please follow the instructions below to prepare all the data needed for the experiments.
Please download the GloVe 300d pretrained vector, which is used for word embeddding initialization in all experiments.
- Download the CNN-Daily story corpus.
- Preprocess the original CNN-Daily story corpus and generate the data file:
cd dataset
python process.py --glove </path/to/the/pickle/file> --data cnn+dailymail --data-dir </path/to/the/corpus> --save-path </path/to/the/output/file> --max-word-num MAX_WORD_NUM
The output file will be used in the data loader when training or testing. To reproduce state-of-the-art result, please use the 300d glove file and use the default max-word-num
- Download the DUC2007 corpus.
- Preprocess the original DUC2007 corpus and generate the data file:
cd dataset
python process.py --glove </path/to/the/pickle/file> --data duc2007 --data-dir </path/to/the/corpus> --save-path </path/to/the/output/file>
The output file will be used in testing.
We modified the original python wrapper of ROUGE-1.5.5, fixing some errors and rewriting its interfaces in a more friendly way. To accelerate the training process and alleviate the freqrent IO operation of ROUGE-1.5.5, we pre-calculate the rouge attention matrix of every document-summary pair. Please use the following command to accomplish this step:
python offline_pyrouge.py --data-path </path/to/the/processed/data> --save-path </path/to/the/output/file>
You can simply use the command below to train the DeepChannel model with the default hyperparameters:
python train.py --data-path </path/to/the/processed/data> --save-dir </path/to/save/the/model> --offline-pyrouge-index-json </path/to/the/offline/pyrouge/file>
You can also try training the model with different hyperparameters:
python train.py --data-path </path/to/the/processed/data> --save-dir </path/to/save/the/model> --offline-pyrouge-index-json </path/to/the/offline/pyrouge/file> --SE-type SE_TYPE --word-dim WORD_DIM --hidden-dim HIDDEN_DIM --dropout DROPOUT --margin MARGIN --lr LR --optimizer OPT ...
For detailed information about all the hyperparameters, please run the command:
python train.py --help
We implement three sentence embedding strategies: GRU, Bi-GRU and average, which can be specified by the argument --SE-type
. If you want to train the model with the reduced dataset, please specify the --fraction
argument.
You can run summarize.py
to apply the greedy extraction procedure on test set as well as evaluating the performance on it. Please ensure the hyperparameters in test step is consistent with the ones in traing step. For comparision, we implement some different extracting strategies, which can be specified by the argument --method
. Typically, you can directly run the following command for a basic evaluation.
python summarize.py --data-path </path/to/the/processed/data> --save-dir </path/to/the/saved/model>
We refer to some codes of these repos:
- Preprocessing of the CNN-Daily dataset
- Tensorflow implementation of Pointer-Generator
- Pytorch implementation of Pointer-Generator
- Pytorch implementation of SummaRuNNer
- Tensorflow implementation of Refresh
- pyrouge
- rouge
Appreciate for their great contributions!