Code and Model for "Proposition-Level Clustering for Multi-Document Summarization" paper
--This repository is still under construction--
supervised_oie_wrapper
directory is a wrapper over AllenNLP's (v0.9.0) pretrained Open IE model that was implemented by Gabriel Stanovsky. It was forked from here, and edited for our purpose.
You are welcome to try our demo. Look for the Multi-Doc Summary by Ernst et al
skill.
-
Download the trained models from here, and put them in 'models' directory.
-
Put your data in
data\<DATASET>\
directory. (For exampledata\DUC2004\
) -
Install
requirements.txt
(python 3.6) -
Create similarity matrix by SuperPAL (to be used for the clustering step):
a. Clone SuperPAL repository.
b. Move files from
SuperPAL
folder in this repository to the newSuperPAL
repository.c. Follow the steps that appear in SuperPAL repository under 'Alignment model' section.
Instead of step 2, run:
python main_predict_inDoc.py -data_path <DATA_PATH> -output_path <OUT_DIR_PATH> -alignment_model_path <ALIGNMENT_MODEL_PATH>
[Optional] 5. Follow this repository to install the official ROUGE measure.
- Extract all Open Information Extraction (OIE) spans from the source documents:
python extract_OIEs.py
- Prepare the data for the Salience model:
python DataGenSalientIU_DUC_allAlignments_CDLM.py
- Predict salience score for each OIE span:
cd transformers
python run_glue_highlighter.py --model_name_or_path <MODEL_PATH> --train_file <DATA_CSV_FILE_PATH> --validation_file <DATA_CSV_FILE_PATH> --do_predict --evaluation_strategy steps --eval_steps 250 --save_steps 250 --max_seq_length 4096 --gradient_accumulation_steps 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-5 --num_train_epochs 3 --output_dir <OUTPUT_DIR>
[Optional] 3*. Cluster salient spans, rank clusters, and select the most salient span to represent each cluster: ("Salience_prop + Clustering" model in Sec 4.3 in the paper)
python deriveSummaryDUC.py
- Cluster salient spans and prepare data for the Fusion model:
python prepare_fusion_data.py --salience_pred_file <SALIENCE_MODEL_SCORES_FILE> --output_summ_dir <OUTPUT_SUMMARY_DIRECTORY_PATH> --data_path <RAW_DATA_PATH> --sim_mat_path <SIM_MAT_PATH>
- Generate a fused sentence from every cluster:
cd <PATH_TO_YOUR_TRANSFORMERS_DIR>\examples\seq2seq\
python finetune_trainer.py --model_name_or_path=<MODEL_PATH> --learning_rate=3e-5 --do_predict --num_train_epochs=4 --evaluation_strategy steps --predict_with_generate --eval_steps=50 --per_device_train_batch_size=10 --per_device_eval_batch_size=10 --max_source_length=265 --eval_beams=6 --max_target_length=30 --val_max_target_length=30 --test_max_target_length=30 --data_dir <DATA_CSV_FILE_PATH> --output_dir <OUTPUT_DIR>
- Concatinate the fused sentences, and calculate final ROUGE scores:
python deriveSummaryDUC_fusion_clusters.py