Skip to content

Code for the paper "Abstractive Summarization Guided by Latent Hierarchical Document Structure"

License

Notifications You must be signed in to change notification settings

yfqiu-nlp/hiergnn

Repository files navigation

Abstractive Summarization Guided by Latent Hierarchical Document Structure

Code and materials for the paper "Abstractive Summarization Guided by Latent Hierarchical Document Structure". Part of our code is borrowed from fairseq implementation for BART. You can first run the baseline to get familiar with the whole pipeline.

Basic installations

You first need to install the fairseq by,

cd fairseq
pip install --editable ./

You then need to download the official checkpoint for bart.large as the backbone for HierGNN-BART from here,

wget https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz
tar -xzvf bart.large.tar.gz
rm bart.large.tar.gz

Please make sure you are using PyTorch==1.7.

Data

Use our data

You can download our used data from here.

Processing the data by yourself (For CNN/DailyMail as the example)

Alternatively, you can first download the original data (without splitting source article into sentences) from here. We then use the sent_tokenize from nltk to split the source article into sentences, and add <cls> between sentences, with the following command,

python3 ssplit.py <input-source-file> <output-processed-file>

For example,

python3 ssplit.py cnndm-raw/train.source cnndm-ssplit/train.source

Then you can BPE all texts using hie_bpe.sh from cnndm-ssplit,

  TASK=cnndm-ssplit
  PROG=fairseq/examples/roberta/multiprocessing_bpe_encoder.py

  for SPLIT in train val
  do
     for LANG in source target
     do
     python $PROG \
           --encoder-json hie_encoder.json \
           --vocab-bpe vocab.bpe \
           --inputs "$TASK/$SPLIT.$LANG" \
           --outputs "$TASK/$SPLIT.bpe.$LANG" \
           --workers 60 \
           --keep-empty;
     done
  done

then binarize the dataset with hie_bin.sh and finally have have the binarized data cnndm-ssplit-bin,

  TASK=cnndm-ssplit
  DICT=checkpoints/dict.source.txt
  fairseq-preprocess \
     --source-lang "source" \
     --target-lang "target" \
     --trainpref "${TASK}/train.bpe" \
     --validpref "${TASK}/val.bpe" \
     --destdir "${TASK}-bin/" \
     --workers 60 \
     --srcdict $DICT \
     --tgtdict $DICT;

Train

The command for training is:

sh hie_train.sh

Valid/Test

The commands for inference is:

sh hie_test.sh

Evaluation

For evaluation, we use the ROUGE implementation from google-research, with the following command,

sh hie_eval.sh

Released Checkpoints and Outputs

ROUGE-1 ROUGE-2 ROUGE-L Checkpoints Outputs
CNN/DailyMail BART
HierGNN-BART
XSum BART
HierGNN-BART
PubMed BART
HierGNN-BART

Citation

@inproceedings{qiu2022hiergnn,
    title={Abstractive Summarization Guided by Latent Hierarchical Document Structure},
    author={Yifu Qiu and Shay Cohen},
    booktitle={The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022)},
    year={2022}
}

About

Code for the paper "Abstractive Summarization Guided by Latent Hierarchical Document Structure"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published