Abstractive text summarization using BERT

This is the models using BERT (refer the paper Pretraining-Based Natural Language Generation for Text Summarization ) for one of the NLP(Natural Language Processing) task, abstractive text summarization.

Requirements

Python 3.6.5+
Pytorch 0.4.1+
Tensorflow
Pandas
tqdm
Numpy
MeCab
Tensorboard X and others...

All packages used here can be installed by pip as follow:

pip install -r requirement.txt

Docker

If you train the model with GPU, it is easy to use Pytorch docker images in DockerHub.

In this study, pytorch/pytorch:0.4.1-cuda9-cudnn7-devel(2.62GB) has been used.

Before using

When you use this, please follow the steps below.

Make a repository named "/data/checkpoint" under root. And put bert_model, vocabulary file and config file for bert. These files can be download here.
Put data file for training and validate under /workspace/data/. The format is as follow:

data = {
    'settings': opt,
    'dict': {
        'src': text2token,
        'tgt': text2token},
    'train': {
        'src': content[:100000],
        'tgt': summary[:100000]},
    'valid': {
        'src': content[100000:],
        'tgt': summary[100000:]}}
torch.save(data, opt.save_data)

overall directory structure is as follow:

`-- data                        # under workspace 
    |-- checkpoint
    |   |-- bert_config.json    # BERT config file
    |   |-- pytorch_model.bin   # BERT model file
    |   `-- vocab.txt           # vocabulary file
    `-- preprocessed_data.data  # train and valid data file

Setting

Name	Value
Encoder	BERT
Decoder	Transformer (Only Decoder)
Embed dimension	768
Hidden dimension	3072
Encoder layers	12
Decoder layers	8
Optimizer	Adam
Learning rate	init=0.0001
Wormup step	4000
Input max length	512
Batch size	4

Usage

Train the model

python train.py -data data/preprocessed_data.data -bert_path data/checkpoint/ -proj_share_weight -label_smoothing -batch_size 4 -epoch 10 -save_model trained -save_mode best

Generate summarization with trained model

python summarize.py -model data/checkpoint/trained/trained.chkpt -src data/preprocessed_data.data -vocab data/checkpoint/vocab.txt -output pred.txt

Resut

Tensorboard X image

TODO

Eval the model with score such as ROUGE-N
Make some examples

Acknowledge

This repository structure and many codes are borrowed from jadore801120/attention-is-all-you-need-pytorch.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
.idea		.idea
data_util		data_util
templates		templates
transformer		transformer
.gitignore		.gitignore
README.md		README.md
dataloader.py		dataloader.py
init.py		init.py
models.py		models.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
server.py		server.py
summarize.py		summarize.py
tokenizer.py		tokenizer.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstractive text summarization using BERT

Requirements

Docker

Before using

Setting

Usage

Train the model

Generate summarization with trained model

Resut

Tensorboard X image

TODO

Acknowledge

About

Releases

Packages

Contributors 2

Languages

IwasakiYuuki/Bert-abstractive-text-summarization

Folders and files

Latest commit

History

Repository files navigation

Abstractive text summarization using BERT

Requirements

Docker

Before using

Setting

Usage

Train the model

Generate summarization with trained model

Resut

Tensorboard X image

TODO

Acknowledge

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages