Code for the paper: Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness
Author: @Yun-Zhu Song, @Yi-Syuan Chen,Hong-Han Shuai
The preprocessed datasets and pretrained model will be released soon.
pip install -r requirements.txt
Steps: (1) download the dataset; (2) get the pseudo extractio oracle and rouge score for each document sentence; (3) generate summary from the fine-tuned abstractor; (4) merge the generated summary to the dataset. The names of datasets can be found in src/data/build_datasets.py.
(1) download dataset
Multi-News
python ./data_download/output_dataset.py\
--output_dir ../datasets/origin/multi_news\
--dataset_name multi_news\
Milti-XScience
python ./data_download/output_dataset.py\
--output_dir ../datasets/origin/xscience\
--dataset_name multi_x_science_sum\
WikiCatSum (NOTE: The version of transformers is 4.12.5)
python ./data_download/output_dataset.py\
--output_dir ../datasets/origin/wikicatsum/animal\
--dataset_name GEM/wiki_cat_sum\
--dataset_config animal\
python ./data_download/output_dataset.py\
--output_dir ../datasets/origin/wikicatsum/company\
--dataset_name GEM/wiki_cat_sum\
--dataset_config company\
python ./data_download/output_dataset.py\
--output_dir ../datasets/origin/wikicatsum/film\
--dataset_name GEM/wiki_cat_sum\
--dataset_config film\
(2) get pseudo extraction (take multi_news as examples)
./scripts/build_POR_dataset.sh
(3) generate summary from fine-tuned abstractor, remember to assign the $checkpoint_to_finetuned_abs and $dataset according to different datasets
./scripts/generate_SR.sh
(4) combining the generated summary to dataset, remember to assign $merged_data_dir, $data_dir, $path_to_generated_summary_train_file, $path_to_generated_summary_val_file, $path_to_generated_summary_test_file according to different datasets
./scripts/build_SR_dataset.sh
Please place the dataset at datasets/ext_oracle/ or change the dataset directory path in src/data/build_datasets.py.
Multi-News, Xscience, WikiCatSum
Dataset | Finetuned Abstractor | Pretrained (REFLECT-MLE) | Final (REFLECT) |
---|---|---|---|
Multi-News | Bart-Base-Oracle, Bart-Large-Oracle | download | download |
Dataset | BART-Large | REFLECT |
---|---|---|
WikiCatSum | Animal, Company, Film | Animal, Company, Film |
There are 4 different configs for abstractor.
Model Size | Input Type |
---|---|
BART Base | Oracle |
BAET Base | Article |
BART Large | Oracle |
BAET Large | Article |
How to change to different configs
dataset_name | Oracle Text Column | Article Text Column |
---|---|---|
multi_news_bl_own | summary_ext | document |
xscience_bl_own | summary_ext | document |
python main.py ./scirpts/args/finetine_abs.json
python main.py ./scripts/args/train_ext_mle.json
python main.py ./scripts/args/train_ext_rl.json
python main.py ./scripts/args/pred.json
python main.py ./scripts/args/eval.json
Arguments for switching between abstractor training or extractor training
"task_type": "seq2seq" for abstractor. "two_stage_extraction" for extractor.
"training_type": "mle" for abstractor finetuning. "ext_mle" for extractor pretraining. "ext_rl" for extractor training.
"data_preprocess": "doc_trun" for abstractor. "doc_trun_and_build_sent_index" for extractor.
Arguments for extractor only
"summary_ext_column": "summary_ext"
Arguments for training our extractor:
"ext_model_name_or_path": Specify the model name or path to give the extractor config. default: deepset/roberta-base-squad2.
"different_base_model_for_two_stage": Specify true when the extractor config and abstractor config are different. default: true.
"load_trained_abstractor_from": Specify the model path for finetuned abstractor.
"load_trained_extractor_from": Specify the model path for pretrained extractor.
"train_only": Specify module name for training the module. default:"extractor"
Arguments for model configuration:
"score_cls_weighting": whether to adopt Peudo Oracle Relaxation (POR), true or false.
"reference_extraction": wether to adopt Summary Referencing (SR), true or false. If true, need to assign the "reference_column" to the column of pregenearted summary.
"reference_column": Assign the column of pregenerated summary in dataset. Only activate when \"reference_extraction\" is true. default: "summary_gen".
""num_hierarchical_layer"": Number of hierarchical layers in extractor, 0 means flat structure for controlling loading pretrained model. Used in main.py. default:3.
Arguments for reinforcement learning:
"use_mixer_loss": Whether to consider the MLE loss. dedault: true.
"mixer_weight": The weight for mixing the MLE and RL loss. default: 0.1.
"update_full_action": Wether to update the full action or only update the output with the sampled action that are different from the greedy action. false for CASC, true for SC.