Data, Code and Model for the paper "Summary-Source Proposition-level Alignment: Task, Datasets and Supervised Baseline".
If you find the code useful, please cite the following paper.
@inproceedings{ernst-etal-2021-summary,
title = "Summary-Source Proposition-level Alignment: Task, Datasets and Supervised Baseline",
author = "Ernst, Ori and Shapira, Ori and Pasunuru, Ramakanth and Lepioshkin, Michael and Goldberger, Jacob and Bansal, Mohit and Dagan, Ido", booktitle = "Proceedings of the 25th Conference on Computational Natural Language Learning", month = nov, year = "2021", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.conll-1.25", pages = "310--322",}
You can use our huggingface model or check our demo here.
run_glue.py
script was forked from huggingface v2.5.1, and edited for our purpose.
supervised_oie_wrapper
directory is a wrapper over AllenNLP's (v0.9.0) pretrained Open IE model that was implemented by Gabriel Stanovsky. It was forked from here, and edited for our purpose.
In this repository we used python-3.6. Please refer to environment_superPAL.yml
for other requirements.
All manual datasets are under manual_datasets
repository, including crowdsourced dev and test sets, and Pyramid-based train set.
As DUC-based datasets are limited to LDC agreement, we provide here only the character index of all propositions or sentences.
To restore the text alignments please use:
python manual_datasets/restore_alignments.py -indx_csv_path <PATH_TO_THE_CSV_WITH_ALIGNMENTS_INDEXES> -documents_path <PATH_TO_THE_DOCUMENTS_ARANGED_BY_TOPIC_DIRECTORIES> -summaries_path <SUMMARIES_PATH> -output_file <ALIGNMENTS_OUTPUT_FILE_PATH>
If you have any issue regarding the DUC alignment regeneration, please contact via email.
MultiNews alignments are released in full.
Predicted alignments of MultiNews and CNN/DailyMail train and val datasets can be found here.
To apply aligment model on your own data, follow the following steps:
-
Download the trained model here.
-
Run
python main_predict.py -data_path <DATA_PATH> -output_path <OUT_DIR_PATH> -alignment_model_path <ALIGNMENT_MODEL_PATH>
<DATA_PATH>
should contain the following structure where a summary and its related document directory share the same name:
- <DATA_PATH>
- summaries
- A.txt
- B.txt
- ...
- A
- doc_A1
- doc_A2
- ...
- B
- doc_B1
- doc_B2
- ...
- It will create two files in
<OUT_DIR_PATH>
:
- 'dev.tsv' - contains all alignment candidate pairs.
- a '.csv' file - contains all predicted aligned pairs with their classification score.
- To use the alignment model with your own data with different properties, you can inherent from the docSum2MRPC_Aligner class and overload the relevant functions.