PyTorch code for Paraphrase Question Generator. This code-based is built upon this paper Learning Semantic Sentence Embeddings using Pair-wise Discriminator.
If you want the code used for experiments please head over to the orig-code
branch. This code consists of some good models mentioned in above paper only with more readable and usable form.
- Install Anaconda or Miniconda distribution based on Python3+ from their downloads' site.
- Clone this repository and create an environment:
git clone https://www.github.com/dev-chauhan/PQG-pytorch
cd PQG-pytorch
conda env create -f environment.yml
# activate the environment
conda activate PQG
- After that for logging you need to install tensorboardX.
pip install tensorboardX
You can directly use following files downloading them into data
folder or by following the process shown below it.
Download all the data files from here.
- quora_data_prepro.h5
- quora_data_prepro.json
- quora_raw_train.json
- quora_raw_val.json
- quora_raw_test.json
We have referred neuraltalk2 and Text-to-Image Synthesis to prepare our code base. The first thing you need to do is to download the Quora Question Pairs dataset from the Quora Question Pair website and put the same in the data
folder.
If you want to train from scratch continue reading or if you just want to evaluate using a pretrained model then head over to Datafiles
section and download the data files (put all the data files in the data
folder) and run score.py
to evaluate pretrained model.
Now we need to do some preprocessing, head over to the prepro
folder and run
$ cd prepro
$ python quora_prepro.py
Note The above command generates json files for 100K question pairs for train, 5k question pairs for validation and 30K question pairs for Test set.
If you want to change this and instead use only 50K question pairs for training and rest remaining the same, then you need to make some minor changes in the above file. After this step, it will generate the files under the data
folder. quora_raw_train.json
, quora_raw_val.json
and quora_raw_test.json
$ python prepro_quora.py --input_train_json ../data/quora_raw_train.json --input_test_json ../data/quora_raw_test.json
This will generate two files in data/
folder, quora_data_prepro.h5
and quora_data_prepro.json
.
$ ./train.sh <name of model> --n_epoch <number of epochs>
You can change training data set and validation data set lengths by adding arguments --train_dataset_len
and --val_dataset_len
which are default to 100000
and 30000
which is maximum.
There are other arguments also for you to experiment like --batch_size
, --learning_rate
, --drop_prob_lm
, etc.
You can resume training using --start_from
argument in which you have to give path of saved model.
First you have to make empty directories save
, samples
, and logs
.
For each training there will be a directory having unique name in save
. Saved model will be a .tar
file. Each model will be saved as <epoch number>
in that directory.
In samples
directory with same unique name as above the directory contains a .txt
file for each epoch as <epoch number>_train.txt
or <epoch number>_val.txt
having generated paraphrases by model at the end of that epoch on validation data set.
Logs for training and evaluation is stored in logs
directory which you can see using tensorboard
by running following command.
tensorboard --logdir <path of logs directory>
This command will tell you where you can see your logs on browser, commonly it is localhost:6006
but you can change it using --port
argument in above command.
Following are the results for 100k quora question pairs dataset for some models.
Name of model | Bleu_1 | Bleu_2 | Bleu_3 | Bleu_4 | ROUGE_L | METEOR | CIDEr |
---|---|---|---|---|---|---|---|
EDL | 0.4162 | 0.2578 | 0.1724 | 0.1219 | 0.4191 | 0.3244 | 0.6189 |
EDLPS | 0.4754 | 0.3160 | 0.2249 | 0.1672 | 0.4781 | 0.3488 | 1.0949 |
Following are the results for 50k quora question pairs dataset for some models.
Name of model | Bleu_1 | Bleu_2 | Bleu_3 | Bleu_4 | ROUGE_L | METEOR | CIDEr |
---|---|---|---|---|---|---|---|
EDL | 0.3877 | 0.2336 | 0.1532 | 0.1067 | 0.3913 | 0.3133 | 0.4550 |
EDLPS | 0.4553 | 0.2981 | 0.2105 | 0.1560 | 0.4583 | 0.3421 | 0.9690 |
If you use this code as part of any published research, please acknowledge the following paper
@inproceedings{patro2018learning,
title={Learning Semantic Sentence Embeddings using Sequential Pair-wise Discriminator},
author={Patro, Badri Narayana and Kurmi, Vinod Kumar and Kumar, Sandeep and Namboodiri, Vinay},
booktitle={Proceedings of the 27th International Conference on Computational Linguistics},
pages={2715--2729},
year={2018}
}
@article{PATRO2021149,
title = {Revisiting paraphrase question generator using pairwise discriminator},
author = {Badri N. Patro and Dev Chauhan and Vinod K. Kurmi and Vinay P. Namboodiri},
journal = {Neurocomputing},
volume = {420},
pages = {149-161},
year = {2021},
issn = {0925-2312},
doi = {https://doi.org/10.1016/j.neucom.2020.08.022},
url = {https://www.sciencedirect.com/science/article/pii/S0925231220312820}
}