This repository contains the code for generating pseudo summary data using the ExtraPhrase method as proposed in the paper:
ExtraPhrase: Efficient Data Augmentation for Abstractive Summarization
Mengsay Loem, Sho Takase, Masahiro Kaneko, and Naoaki Okazaki
In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop, pages 16–24, Hybrid: Seattle, Washington + Online. Association for Computational Linguistics.
Gigaword: data
CNN/DailyMail: data
-
Installation: Make sure you have Python 3.x and Spacy installed. You can install Spacy by running
pip install spacy
. -
Download Language Model: Download the English language model for Spacy by running
python -m spacy download en_core_web_sm
.
Here is an example script to run the extractive summarization:
python extractive_summarization.py \
--input_file dummy_data/input.json \
--depth_ratio 0.5 \
--group_tokens \
--output_file dummy_data/step1_output.json
--input_file
: Path to the input text file in JSON format.--depth_ratio
: Ratio of tree depth to prune.--group_tokens
: Optional flag to group some token nodes before pruning.--output_file
: Path to the output file.
- Installation: Make sure you have the Hugging Face Transformers library installed. You can install it by running
pip install transformers
.
Here is an example script to run the paraphrasing:
python paraphrasing.py \
--input_file dummy_data/step1_output.json \
--src_tgt_model facebook/wmt19-en-de \
--tgt_src_model facebook/wmt19-de-en \
--output_file dummy_data/step2_output.json \
--use_gpu
--input_file
: Path to the input text file generated from Step 1.--src_tgt_model
: Path or name of the model for translation from source language to target language.--tgt_src_model
: Path or name of the model for translation from target language back to source language.--output_file
: Path to the output file.--use_gpu
: Optional flag to enable GPU usage for inference.