Code and materials for the paper "Abstractive Summarization Guided by Latent Hierarchical Document Structure". Part of our code is borrowed from fairseq implementation for BART. You can first run the baseline to get familiar with the whole pipeline.
You first need to install the fairseq by,
cd fairseq
pip install --editable ./
You then need to download the official checkpoint for bart.large
as the backbone for HierGNN-BART from here,
wget https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz
tar -xzvf bart.large.tar.gz
rm bart.large.tar.gz
Please make sure you are using PyTorch==1.7
.
You can download our used data from here.
Alternatively, you can first download the original data (without splitting source article into sentences) from here. We then use the sent_tokenize
from nltk to split the source article into sentences, and add <cls>
between sentences, with the following command,
python3 ssplit.py <input-source-file> <output-processed-file>
For example,
python3 ssplit.py cnndm-raw/train.source cnndm-ssplit/train.source
Then you can BPE all texts using hie_bpe.sh
from cnndm-ssplit
,
TASK=cnndm-ssplit
PROG=fairseq/examples/roberta/multiprocessing_bpe_encoder.py
for SPLIT in train val
do
for LANG in source target
do
python $PROG \
--encoder-json hie_encoder.json \
--vocab-bpe vocab.bpe \
--inputs "$TASK/$SPLIT.$LANG" \
--outputs "$TASK/$SPLIT.bpe.$LANG" \
--workers 60 \
--keep-empty;
done
done
then binarize the dataset with hie_bin.sh
and finally have have the binarized data cnndm-ssplit-bin
,
TASK=cnndm-ssplit
DICT=checkpoints/dict.source.txt
fairseq-preprocess \
--source-lang "source" \
--target-lang "target" \
--trainpref "${TASK}/train.bpe" \
--validpref "${TASK}/val.bpe" \
--destdir "${TASK}-bin/" \
--workers 60 \
--srcdict $DICT \
--tgtdict $DICT;
The command for training is:
sh hie_train.sh
The commands for inference is:
sh hie_test.sh
For evaluation, we use the ROUGE implementation from google-research, with the following command,
sh hie_eval.sh
ROUGE-1 | ROUGE-2 | ROUGE-L | Checkpoints | Outputs | ||
CNN/DailyMail | BART | |||||
HierGNN-BART | ||||||
XSum | BART | |||||
HierGNN-BART | ||||||
PubMed | BART | |||||
HierGNN-BART |
@inproceedings{qiu2022hiergnn,
title={Abstractive Summarization Guided by Latent Hierarchical Document Structure},
author={Yifu Qiu and Shay Cohen},
booktitle={The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022)},
year={2022}
}