GitHub - dev-chauhan/PQG-pytorch: Paraphrase Generation model using pair-wise discriminator loss

Paraphrase Question Generator using Shared Discriminator

PyTorch code for Paraphrase Question Generator. This code-based is built upon this paper Learning Semantic Sentence Embeddings using Pair-wise Discriminator.

If you want the code used for experiments please head over to the orig-code branch. This code consists of some good models mentioned in above paper only with more readable and usable form.

Requirements and Setup

Use Anaconda or Miniconda

Install Anaconda or Miniconda distribution based on Python3+ from their downloads' site.
Clone this repository and create an environment:

git clone https://www.github.com/dev-chauhan/PQG-pytorch
cd PQG-pytorch
conda env create -f environment.yml

# activate the environment
conda activate PQG

After that for logging you need to install tensorboardX.

pip install tensorboardX

Dataset

You can directly use following files downloading them into data folder or by following the process shown below it.

Data Files

Download all the data files from here.

Download Dataset

We have referred neuraltalk2 and Text-to-Image Synthesis to prepare our code base. The first thing you need to do is to download the Quora Question Pairs dataset from the Quora Question Pair website and put the same in the data folder.

If you want to train from scratch continue reading or if you just want to evaluate using a pretrained model then head over to Datafiles section and download the data files (put all the data files in the data folder) and run score.py to evaluate pretrained model.

Now we need to do some preprocessing, head over to the prepro folder and run

$ cd prepro
$ python quora_prepro.py

Note The above command generates json files for 100K question pairs for train, 5k question pairs for validation and 30K question pairs for Test set. If you want to change this and instead use only 50K question pairs for training and rest remaining the same, then you need to make some minor changes in the above file. After this step, it will generate the files under the data folder. quora_raw_train.json, quora_raw_val.json and quora_raw_test.json

Preprocess Paraphrase Question

$ python prepro_quora.py --input_train_json ../data/quora_raw_train.json --input_test_json ../data/quora_raw_test.json

This will generate two files in data/ folder, quora_data_prepro.h5 and quora_data_prepro.json.

Training

$ ./train.sh <name of model> --n_epoch <number of epochs>

You can change training data set and validation data set lengths by adding arguments --train_dataset_len and --val_dataset_len which are default to 100000 and 30000 which is maximum.

There are other arguments also for you to experiment like --batch_size, --learning_rate, --drop_prob_lm, etc.

You can resume training using --start_from argument in which you have to give path of saved model.

Save and log

First you have to make empty directories save, samples, and logs.
For each training there will be a directory having unique name in save. Saved model will be a .tar file. Each model will be saved as <epoch number> in that directory.

In samples directory with same unique name as above the directory contains a .txt file for each epoch as <epoch number>_train.txt or <epoch number>_val.txt having generated paraphrases by model at the end of that epoch on validation data set.

Logs for training and evaluation is stored in logs directory which you can see using tensorboard by running following command.

tensorboard --logdir <path of logs directory>

This command will tell you where you can see your logs on browser, commonly it is localhost:6006 but you can change it using --port argument in above command.

Results

Following are the results for 100k quora question pairs dataset for some models.

Name of model	Bleu_1	Bleu_2	Bleu_3	Bleu_4	ROUGE_L	METEOR	CIDEr
EDL	0.4162	0.2578	0.1724	0.1219	0.4191	0.3244	0.6189
EDLPS	0.4754	0.3160	0.2249	0.1672	0.4781	0.3488	1.0949

Following are the results for 50k quora question pairs dataset for some models.

Name of model	Bleu_1	Bleu_2	Bleu_3	Bleu_4	ROUGE_L	METEOR	CIDEr
EDL	0.3877	0.2336	0.1532	0.1067	0.3913	0.3133	0.4550
EDLPS	0.4553	0.2981	0.2105	0.1560	0.4583	0.3421	0.9690

Reference

If you use this code as part of any published research, please acknowledge the following paper

@inproceedings{patro2018learning,
  title={Learning Semantic Sentence Embeddings using Sequential Pair-wise Discriminator},
  author={Patro, Badri Narayana and Kurmi, Vinod Kumar and Kumar, Sandeep and Namboodiri, Vinay},
  booktitle={Proceedings of the 27th International Conference on Computational Linguistics},
  pages={2715--2729},
  year={2018}
}


@article{PATRO2021149,
title = {Revisiting paraphrase question generator using pairwise discriminator},
author = {Badri N. Patro and Dev Chauhan and Vinod K. Kurmi and Vinay P. Namboodiri},
journal = {Neurocomputing},
volume = {420},
pages = {149-161},
year = {2021},
issn = {0925-2312},
doi = {https://doi.org/10.1016/j.neucom.2020.08.022},
url = {https://www.sciencedirect.com/science/article/pii/S0925231220312820}
}

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
data/quora_dataset		data/quora_dataset
misc		misc
models		models
prepro		prepro
pycocoevalcap		pycocoevalcap
training		training
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paraphrase Question Generator using Shared Discriminator

Requirements and Setup

Use Anaconda or Miniconda

Dataset

Data Files

Download Dataset

Preprocess Paraphrase Question

Training

Save and log

Results

Reference

Contributors

About

Releases

Packages

Contributors 2

Languages

License

dev-chauhan/PQG-pytorch

Folders and files

Latest commit

History

Repository files navigation

Paraphrase Question Generator using Shared Discriminator

Requirements and Setup

Use Anaconda or Miniconda

Dataset

Data Files

Download Dataset

Preprocess Paraphrase Question

Training

Save and log

Results

Reference

Contributors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages