This is the code for the paper Generating Summaries with Topic Guidance and Structured Convolutional Decoders by Laura Perez-Beltrachini, Yang Liu and Mirella Lapata.
In this repository we include a link to our WikiCatSum dataset and code for our ConvS2D model. Our code extends an earlier copy of Facebook AI Research Sequence-to-Sequence Toolkit with a sentence aware Structured Convolutional Decoder.
Python 3.6.6 Torch 0.4.0
The WikiCatSum dataset is available in this repository and also on HuggingFace datasets (follow this link).
Related scripts are available in the wikicatsum/ directory.
Using the files in the downloaded datasets you can generate data and dictionaries with the following command. You will need to define the variables as convenient.
should be the directory where to find the source and target texts
is the directory where to find the topic model
is the length at which you will truncate the input sequence of paragraphs
Pre-process for the hierarchical decoder and topic labels:
python --source-lang src --target-lang tgt \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/$DSTDIR \
--nwordstgt 50000 --nwordssrc 50000 --L $SRC_L \
--addAnnotations $ANNOT/$DOMAIN'.'$NUMTOPICS'.TLDA' --numTopics $NUMTOPICS \
--src-chunk-length 200 --tgt-chunk-length $MAX_TGT_SENT_LEN \
1> data-bin/$DSTDIR/preprocess.log
Use argument --singleSeq
to create source and target as a single long sequence:
python --source-lang src --target-lang tgt \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/$DSTDIR \
--nwordstgt 50000 --nwordssrc 50000 \
--singleSeq --L $SRC_L \
1> data-bin/$DSTDIR/preprocess.log
After you preprocessed the files you can run the training procedures.
CUDA_VISIBLE_DEVICES=$GPUID python data-bin/$DATADIR --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 --arch fconv_wikicatsum --save-dir checkpoints/$MODELNAME --skip-invalid-size-inputs-valid-test --no-progress-bar --task translation --max-target-positions $MAX_TGT_SENT_LEN --max-source-positions MAX_SRC_POSITIONS --outindices checkpoints/$IDXEXCLDIR/ignoredIndices.log --outindicesValid $OUTDIR$IDXEXCLDIR/valid_ignoredIndices.log 1> 'checkpoints/'$MODELNAME'/train.log'
and --outindicesValid
should point to files with list of excluded instances' indices. You should define the other variables as convenient.
CUDA_VISIBLE_DEVICES=$GPUID python data-bin/$DATADIR --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 --arch fconv_fatte_nokey_wikicatsum --save-dir checkpoints/$MODELNAME --skip-invalid-size-inputs-valid-test --no-progress-bar --task wikicatsum --annotations --max-source-positions $MAX_SRC_POSITIONS --max-target-positions 15 --max-tgt-sentence-length $MAX_TGT_SENT_LEN --criterion cross_entropy --num-topics $NUMKEYS --flatenc --hidemb --normpos --flatdata data-bin/$FLATDATADIR 1> 'checkpoints/'$MODELNAME'/train.log'
tells the number of topics in the dataset, this is not used by the encoder-decoder model but just by the data-loader.
gives the path to the binaries with tensors of ids are a single sequence.
CUDA_VISIBLE_DEVICES=$GPUID python data-bin/$DATADIR --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 --arch fconv_fatte_wikicatsum --save-dir checkpoints/$MODELNAME --skip-invalid-size-inputs-valid-test --no-progress-bar --task wikicatsum --annotations --max-source-positions $MAX_SRC_POSITIONS --max-target-positions 15 --max-tgt-sentence-length $MAX_TGT_SENT_LEN --criterion cross_entropy_kpred_1t --num-topics $NUMKEYS --outindices checkpoints/$IDXEXCLDIR/ignoredIndices.log --flatenc --flatdata data-bin/$FLATDATADIR --hidemb --normpos --lambda-keyloss 1 1> 'checkpoints/'$MODELNAME'/train.log'
Generating with obtained models.
CUDA_VISIBLE_DEVICES=2 python data-bin/$DATADIR --path checkpoints/$MODELNAME/ --beam 5 --skip-invalid-size-inputs-valid-test --decode-dir $DECODEDIR --reference-dir $REFDIR --outindices $IDXEXCLDIR/valid_ignoredIndices.log --max-target-positions $MAX_TGT_SENT_LEN --quiet --gen-subset valid 1> $DECODEDIR/generate.log
You can also select best checkpoint based on ROUGE on valid:
export ARG_LIST="--beam 5 --skip-invalid-size-inputs-valid-test --reference-dir $REFDIR --outindices $IDXEXCLDIR/valid_ignoredIndices.log --max-target-positions $MAX_TGT_SENT_LEN --quiet "
--data-dir data-bin/$DATADIR \
--model-dir checkpoints/$MODELNAME \
--reference-dir $REFDIR \
CUDA_VISIBLE_DEVICES=$GPUID python data-bin/$DATADIR --keywords-embed-path data-bin/$DATADIR/train_keyEmbeddings.txt --path checkpoints/$MODELNAME/ --batch-size 5 --beam 5 --skip-invalid-size-inputs-valid-test --decode-dir $DECODEDIR --reference-dir $REFDIR --task wikicatsum --annotations --max-source-positions $MAX_SRC_POSITIONS --max-target-positions 15 --max-tgt-sentence-length $MAX_TGT_SENT_LEN --quiet --gen-subset valid --flatenc --target-raw-text --sepahypo --naive --ngram 3 1> $DECODEDIR'/generate.log'
will generate the references formatted as needed for ROUGE scripts. To this you will need to place the file containing the summaries (e.g. valid.tgt) in the same directory where the binaries are (e.g. data-bin/$DATADIR).
You can also select best checkpoint based on ROUGE on valid:
export ARG_LIST="--keywords-embed-path data-bin/$DATADIR/train_keyEmbeddings.txt --batch-size 7 --beam 5 --skip-invalid-size-inputs-valid-test --reference-dir $REFDIR --task wikicatsum --annotations --max-source-positions $MAX_SRC_POSITIONS --max-target-positions 15 --max-tgt-sentence-length $MAX_TGT_SENT_LEN --quiet --gen-subset valid --flatenc --sepahypo --naive --ngram 3 "
--data-dir data-bin/$DATADIR \
--model-dir checkpoints/$MODELNAME \
--reference-dir $REFDIR
CUDA_VISIBLE_DEVICES=3 python data-bin/$DATADIR --keywords-embed-path data-bin/$DATADIR/train_keyEmbeddings.txt --path checkpoints$MODELNAME/ --batch-size 5 --beam 5 --skip-invalid-size-inputs-valid-test --decode-dir $DECODEDIR --reference-dir $REFDIR --task wikicatsum --annotations --max-source-positions $MAX_SRC_POSITIONS --max-target-positions 15 --max-tgt-sentence-length $MAX_TGT_SENT_LEN --quiet --gen-subset valid --flatenc --sepahypo --naive --ngram 3 --keystop --num-topics 40 1> $DECODEDIR/generate.log
Checkpoint selection based on ROUGE is similar to that of ConvS2D.
Evaluation with ROUGE is based on the pyrouge package.
Rouge evaluation scripts are adapted from here.
will save files for ROUGE evaluation.
Install pyrouge
(pip install pyrouge) and cloned it and configure ROUGE
environment variable to the script within your
pyrouge directory.
If you get a WordNet db error, proceed as explained here.
You can run the following to get ROUGE scores on the models' outputs:
export ROUGE=$HOME/pyrouge/tools/ROUGE-1.5.5/
python evaluation/ --rouge --decode_dir $DECODEDIR
To compute the additional Abstract and Copy metrics on models' outputs use the following command:
python evaluation/ --decode-dir $DECODEDIR
is the directory that contains source (.src) and reference (.tgt) text files of the dataset.