New!! The dataset is now available at Hugging Face 🤗
Paper: ACL 2021, Bringing Structure into Summaries: a Faceted Summarization Dataset for Long Scientific Documents
There are some things that you need to familiarise yourself with / consider:
Prerequisites:
- PyTorch
- Fairseq
- Download the pretrained
BART-large
model - Get the Emerald dataset
$> python preprocess_data.py
$> bash bpe.sh
$> bash binarize.sh
For parameters, check the finetune.sh
script.
Although we did not find major differences with updating the max_tokens
parameter during BART finetuning, in case you want to try it, the code allows to change the parameter (in the scripts/train.py
file).
Find more information at fairseq bart repo!