Code for ACL 2020 paper: Extractive Summarization as Text Matching
- Python 3.7
- PyTorch 1.4.0
- fastNLP 0.5.0
- pyrouge 0.1.3
- You should fill your ROUGE path in metrics.py line 20 before running our code.
- rouge 1.0.0
- Used in the validation phase.
- transformers 2.5.1
All code only supports running on Linux.
We have already processed CNN/DailyMail dataset, you can download it through this link, unzip and move it to ./data
. It contains two versions (BERT/RoBERTa) of the dataset, a total of six files.
In addition, we have released five other processed datasets (WikiHow, PubMed, XSum, MultiNews, Reddit), which you can find here.
We use eight Tesla-V100-16G GPUs to train our model, the training time is about 30 hours. If you do not have enough video memory, you can reduce the batch_size or candidate_num in train_matching.py
, or you can adjust max_len in dataloader.py
.
You can choose BERT or RoBERTa as the encoder of MatchSum, for example, to train a RoBERTa model, you can run the following command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train_matching.py --mode=train --encoder=roberta --save_path=./roberta --gpus=0,1,2,3,4,5,6,7
After completing the training process, several best checkpoints are stored in a folder named after the training start time, for example, ./roberta/2020-04-12-09-24-51
. You can run the following command to get the results on test set (only one GPU is required for testing):
CUDA_VISIBLE_DEVICES=0 python train_matching.py --mode=test --encoder=roberta --save_path=./roberta/2020-04-12-09-24-51/ --gpus=0
The ROUGE score will be printed on the screen, and the output of the model will be stored in the folder ./roberta/result
.
Test set (the average of three runs)
Model | R-1 | R-2 | R-L |
---|---|---|---|
MatchSum (BERT-base) | 44.22 | 20.62 | 40.38 |
MatchSum (RoBERTa-base) | 44.41 | 20.86 | 40.55 |
The summaries generated by our models on the CNN/DM dataset can be found here. In the version we released, the result of MatchSum(BERT) is 44.26/20.58/40.40 (R-1/R-2/R-L), and the result of MatchSum(RoBERTa) is 44.45/20.88/40.60.
The summaries generated on other datasets can be found here.
Two versions of the pre-trained model on CNN/DM are available here. You can use them through torch.load
. For example,
model = torch.load('MatchSum_cnndm_bert.ckpt')
Besides, the pre-trained models on other datasets can be found here.
If you want to process your own data and get candidate summaries for each document, first you need to convert your dataset to the same jsonl format as ours, and make sure to include text and summary fields. Second, you should use BertExt or other methods to select some important sentences from each document and get an index.jsonl file (we provide an example in ./preprocess/test_cnndm.jsonl
).
Then you can run the following command:
python get_candidate.py --tokenizer=bert --data_path=/path/to/your_original_data.jsonl --index_path=/path/to/your_index.jsonl --write_path=/path/to/store/your_processed_data.jsonl
Please fill your ROUGE path in preprocess/get_candidate.py
line 22 before running this command. It is worth noting that you need to adjust the number of candidate summaries and the number of sentences in the candidate summaries according to your dataset. For details, see line 89-97 in preprocess/get_candidate.py
.
After processing the dataset, and before using our code to train your own model, please adjust candidate_num in train_matching.py
and max_len in dataloader.py
according to the number and the length of the candidate summaries in your dataset.
The code and data released here are used for the matching model. Before the matching stage, we use BertExt to prune meaningless candidate summaries, the implementation of BertExt can refer to PreSumm.