Skip to content

Latest commit

 

History

History
218 lines (173 loc) · 10.2 KB

README.md

File metadata and controls

218 lines (173 loc) · 10.2 KB

summarus

Tests Status Code Climate

Abstractive and extractive summarization models, mostly for Russian language. Building on top of AllenNLP

You can also checkout the MBART-based Russian summarization model on Huggingface: mbart_ru_sum_gazeta

Based on the following papers:

Contacts

Prerequisites

pip install -r requirements.txt

Commands

train.sh

Script for training a model based on AllenNLP 'train' command.

Argument Required Description
-c true path to file with configuration
-s true path to directory where model will be saved
-t true path to train dataset
-v true path to val dataset
-r false recover from checkpoint

predict.sh

Script for model evaluation. The test dataset should have the same format as the train dataset.

Argument Required Default Description
-t true path to test dataset
-m true path to tar.gz archive with model
-p true name of Predictor
-c false 0 CUDA device
-L true Language ("ru" or "en")
-b false 32 size of a batch with test examples to run simultaneously
-M false path to meteor.jar for Meteor metric
-T false tokenize gold and predicted summaries before metrics calculation
-D false save temporary files with gold and predicted summaries

summarus.util.train_subword_model

Script for subword model training.

Argument Default Description
--train-path path to train dataset
--model-path path to directory where generated subword model will be saved
--model-type bpe type of subword model, see sentencepiece
--vocab-size 50000 size of the resulting subword model vocabulary
--config-path path to file with configuration for DatasetReader (with parse_set)

Headline generation

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m ria_pgn_24kk.tar.gz -p subwords_summary -L ru 

Results

Train dataset: RIA, test dataset: RIA
Model R-1-f R-2-f R-L-f BLEU
ria_copynet_10kk 40.0 23.3 37.5 -
ria_pgn_24kk 42.3 25.1 39.6 -
ria_mbart 42.8 25.5 39.9 -
First Sentence 24.1 10.6 16.7 -

Train dataset: RIA, eval dataset: Lenta

Model R-1-f R-2-f R-L-f BLEU
ria_copynet_10kk 25.6 12.3 23.0 -
ria_pgn_24kk 26.4 12.3 24.0 -
ria_mbart 30.3 14.5 27.1 -
First Sentence 25.5 11.2 19.2 -

Summarization - CNN/DailyMail

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m cnndm_pgn_25kk.tar.gz -p words_summary -L en -R

Results:

Model R-1-f R-2-f R-L-f METEOR BLEU
cnndm_pgn_25kk 38.5 16.5 33.4 17.6 -

Summarization - Gazeta, russian news dataset

Models:

Prediction scripts:

./predict.sh -t <path_to_test_dataset> -m gazeta_pgn_7kk.tar.gz -p subwords_summary -L ru -T
./predict.sh -t <path_to_test_dataset> -m gazeta_summarunner_3kk.tar.gz -p subwords_summary_sentences -L ru -T

External models:

Results:

Model R-1-f R-2-f R-L-f METEOR BLEU
gazeta_pgn_7kk 29.4 12.7 24.6 21.2 9.0
gazeta_pgn_7kk_cov 29.8 12.8 25.4 22.1 10.1
gazeta_pgn_25kk 29.6 12.8 24.6 21.5 9.3
gazeta_pgn_words_13kk 29.4 12.6 24.4 20.9 8.9
gazeta_summarunner_3kk 31.6 13.7 27.1 26.0 11.5
gazeta_mbart 32.6 14.6 28.2 25.7 12.4
gazeta_mbart_lower 32.7 14.7 28.3 25.8 12.5

Demo

python demo/server.py --include-package summarus --model-dir <model_dir> --host <host> --port <port>

Citations

Headline generation (PGN):

@article{Gusev2019headlines,
    author={Gusev, I.O.},
    title={Importance of copying mechanism for news headline generation},
    journal={Komp'juternaja Lingvistika i Intellektual'nye Tehnologii},
    year={2019},
    volume={2019-May},
    number={18},
    pages={229--236}
}

Headline generation (transformers):

@InProceedings{Bukhtiyarov2020headlines,
    author={Bukhtiyarov, Alexey and Gusev, Ilya},
    title="Advances of Transformer-Based Models for News Headline Generation",
    booktitle="Artificial Intelligence and Natural Language",
    year="2020",
    publisher="Springer International Publishing",
    address="Cham",
    pages={54--61},
    isbn="978-3-030-59082-6",
    doi={10.1007/978-3-030-59082-6_4}
}

Summarization:

@InProceedings{Gusev2020gazeta,
    author="Gusev, Ilya",
    title="Dataset for Automatic Summarization of Russian News",
    booktitle="Artificial Intelligence and Natural Language",
    year="2020",
    publisher="Springer International Publishing",
    address="Cham",
    pages="{122--134}",
    isbn="978-3-030-59082-6",
    doi={10.1007/978-3-030-59082-6_9}
}