Abstractive Summarization Guided by Latent Hierarchical Document Structure

Code and materials for the paper "Abstractive Summarization Guided by Latent Hierarchical Document Structure". Part of our code is borrowed from fairseq implementation for BART. You can first run the baseline to get familiar with the whole pipeline.

Basic installations

You first need to install the fairseq by,

cd fairseq
pip install --editable ./

You then need to download the official checkpoint for bart.large as the backbone for HierGNN-BART from here,

wget https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz
tar -xzvf bart.large.tar.gz
rm bart.large.tar.gz

Please make sure you are using PyTorch==1.7.

Data

Use our data

You can download our used data from here.

Processing the data by yourself (For CNN/DailyMail as the example)

Alternatively, you can first download the original data (without splitting source article into sentences) from here. We then use the sent_tokenize from nltk to split the source article into sentences, and add <cls> between sentences, with the following command,

python3 ssplit.py <input-source-file> <output-processed-file>

For example,

python3 ssplit.py cnndm-raw/train.source cnndm-ssplit/train.source

Then you can BPE all texts using hie_bpe.sh from cnndm-ssplit,

  TASK=cnndm-ssplit
  PROG=fairseq/examples/roberta/multiprocessing_bpe_encoder.py

  for SPLIT in train val
  do
     for LANG in source target
     do
     python $PROG \
           --encoder-json hie_encoder.json \
           --vocab-bpe vocab.bpe \
           --inputs "$TASK/$SPLIT.$LANG" \
           --outputs "$TASK/$SPLIT.bpe.$LANG" \
           --workers 60 \
           --keep-empty;
     done
  done

then binarize the dataset with hie_bin.sh and finally have have the binarized data cnndm-ssplit-bin,

  TASK=cnndm-ssplit
  DICT=checkpoints/dict.source.txt
  fairseq-preprocess \
     --source-lang "source" \
     --target-lang "target" \
     --trainpref "${TASK}/train.bpe" \
     --validpref "${TASK}/val.bpe" \
     --destdir "${TASK}-bin/" \
     --workers 60 \
     --srcdict $DICT \
     --tgtdict $DICT;

Train

The command for training is:

sh hie_train.sh

Valid/Test

The commands for inference is:

sh hie_test.sh

Evaluation

For evaluation, we use the ROUGE implementation from google-research, with the following command,

sh hie_eval.sh

Released Checkpoints and Outputs

		ROUGE-1	ROUGE-2	ROUGE-L	Checkpoints	Outputs
CNN/DailyMail	BART
	HierGNN-BART
XSum	BART
	HierGNN-BART
PubMed	BART
	HierGNN-BART

Citation

@inproceedings{qiu2022hiergnn,
    title={Abstractive Summarization Guided by Latent Hierarchical Document Structure},
    author={Yifu Qiu and Shay Cohen},
    booktitle={The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022)},
    year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstractive Summarization Guided by Latent Hierarchical Document Structure

Basic installations

Data

Use our data

Processing the data by yourself (For CNN/DailyMail as the example)

Train

Valid/Test

Evaluation

Released Checkpoints and Outputs

Citation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
checkpoints		checkpoints
fairseq		fairseq
rouge		rouge
LICENSE		LICENSE
README.md		README.md
hie_bin.sh		hie_bin.sh
hie_bpe.sh		hie_bpe.sh
hie_encoder.json		hie_encoder.json
hie_eval.sh		hie_eval.sh
hie_test.sh		hie_test.sh
hie_train.sh		hie_train.sh
ssplit.py		ssplit.py
summarize.py		summarize.py
vocab.bpe		vocab.bpe

License

yfqiu-nlp/hiergnn

Folders and files

Latest commit

History

Repository files navigation

Abstractive Summarization Guided by Latent Hierarchical Document Structure

Basic installations

Data

Use our data

Processing the data by yourself (For CNN/DailyMail as the example)

Train

Valid/Test

Evaluation

Released Checkpoints and Outputs

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages