NAACL2022-REFLECT

Code for the paper: Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness

Author: @Yun-Zhu Song, @Yi-Syuan Chen,Hong-Han Shuai

The preprocessed datasets and pretrained model will be released soon.

Referenced Environment Setup

pip install -r requirements.txt

Dataset Preparation

Option 1.

Steps: (1) download the dataset; (2) get the pseudo extractio oracle and rouge score for each document sentence; (3) generate summary from the fine-tuned abstractor; (4) merge the generated summary to the dataset. The names of datasets can be found in src/data/build_datasets.py.

(1) download dataset

Multi-News

python ./data_download/output_dataset.py\
  --output_dir ../datasets/origin/multi_news\
  --dataset_name multi_news\

Milti-XScience

python ./data_download/output_dataset.py\
  --output_dir ../datasets/origin/xscience\
  --dataset_name multi_x_science_sum\

WikiCatSum (NOTE: The version of transformers is 4.12.5)

python ./data_download/output_dataset.py\
  --output_dir ../datasets/origin/wikicatsum/animal\
  --dataset_name GEM/wiki_cat_sum\
  --dataset_config animal\
  
python ./data_download/output_dataset.py\
  --output_dir ../datasets/origin/wikicatsum/company\
  --dataset_name GEM/wiki_cat_sum\
  --dataset_config company\
  
python ./data_download/output_dataset.py\
  --output_dir ../datasets/origin/wikicatsum/film\
  --dataset_name GEM/wiki_cat_sum\
  --dataset_config film\

(2) get pseudo extraction (take multi_news as examples)

./scripts/build_POR_dataset.sh

(3) generate summary from fine-tuned abstractor, remember to assign the $checkpoint_to_finetuned_abs and $dataset according to different datasets

./scripts/generate_SR.sh

(4) combining the generated summary to dataset, remember to assign $merged_data_dir, $data_dir, $path_to_generated_summary_train_file, $path_to_generated_summary_val_file, $path_to_generated_summary_test_file according to different datasets

./scripts/build_SR_dataset.sh

Option 2. Dowload Our Processed Dataset

Please place the dataset at datasets/ext_oracle/ or change the dataset directory path in src/data/build_datasets.py.

Multi-News, Xscience, WikiCatSum

Trained Model

Dataset	Finetuned Abstractor	Pretrained (REFLECT-MLE)	Final (REFLECT)
Multi-News	Bart-Base-Oracle, Bart-Large-Oracle	download	download

Predictions

Dataset	BART-Large	REFLECT
WikiCatSum	Animal, Company, Film	Animal, Company, Film

Training

1. Abstractor Training

There are 4 different configs for abstractor.

Model Size	Input Type
BART Base	Oracle
BAET Base	Article
BART Large	Oracle
BAET Large	Article

How to change to different configs

dataset_name	Oracle Text Column	Article Text Column
multi_news_bl_own	summary_ext	document
xscience_bl_own	summary_ext	document

python main.py ./scirpts/args/finetine_abs.json

2. Extractor Pretraining

python main.py ./scripts/args/train_ext_mle.json

3. Extractor Training

python main.py ./scripts/args/train_ext_rl.json

4. Model Evaluation

python main.py ./scripts/args/pred.json
python main.py ./scripts/args/eval.json

Argument Description

Arguments for switching between abstractor training or extractor training

"task_type": "seq2seq" for abstractor. "two_stage_extraction" for extractor.
"training_type": "mle" for abstractor finetuning. "ext_mle" for extractor pretraining. "ext_rl" for extractor training.
"data_preprocess": "doc_trun" for abstractor. "doc_trun_and_build_sent_index" for extractor.

Arguments for extractor only

"summary_ext_column": "summary_ext"

Arguments for training our extractor:

"ext_model_name_or_path": Specify the model name or path to give the extractor config. default: deepset/roberta-base-squad2.
"different_base_model_for_two_stage": Specify true when the extractor config and abstractor config are different. default: true.
"load_trained_abstractor_from": Specify the model path for finetuned abstractor.
"load_trained_extractor_from": Specify the model path for pretrained extractor.
"train_only": Specify module name for training the module. default:"extractor"

Arguments for model configuration:

"score_cls_weighting": whether to adopt Peudo Oracle Relaxation (POR), true or false.
"reference_extraction": wether to adopt Summary Referencing (SR), true or false. If true, need to assign the "reference_column" to the column of pregenearted summary.
"reference_column": Assign the column of pregenerated summary in dataset. Only activate when \"reference_extraction\" is true. default: "summary_gen".
""num_hierarchical_layer"": Number of hierarchical layers in extractor, 0 means flat structure for controlling loading pretrained model. Used in main.py. default:3.

Arguments for reinforcement learning:

"use_mixer_loss": Whether to consider the MLE loss. dedault: true.
"mixer_weight": The weight for mixing the MLE and RL loss. default: 0.1.
"update_full_action": Wether to update the full action or only update the output with the sampled action that are different from the greedy action. false for CASC, true for SC.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NAACL2022-REFLECT

Referenced Environment Setup

Dataset Preparation

Option 1.

Option 2. Dowload Our Processed Dataset

Trained Model

Predictions

Training

1. Abstractor Training

2. Extractor Pretraining

3. Extractor Training

4. Model Evaluation

Argument Description

About

Releases

Packages

Languages

PWigunarta/Multi-Document-Summarization

Folders and files

Latest commit

History

Repository files navigation

NAACL2022-REFLECT

Referenced Environment Setup

Dataset Preparation

Option 1.

Option 2. Dowload Our Processed Dataset

Trained Model

Predictions

Training

1. Abstractor Training

2. Extractor Pretraining

3. Extractor Training

4. Model Evaluation

Argument Description

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages