Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Source code for NAACL 2019 paper: Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Overview of DiPS during decoding to generate k paraphrases. At each time step, a set of N sequences V^(t) is used to determine k < N sequences (X^∗) via submodular maximization . The above figure illustrates the motivation behind each submodular component. Please see Section 4 in the paper for details.

Also on GEM/NL-Augmenter 🦎 → 🐍

Please use/check diverse_paraphrase in NL-Augmenter for the transformer-model version. Diverse-Paraphrase: NL-Augmenter.

Dependencies

compatible with python 3.6
dependencies can be installed using requirements.txt

Dataset

Download the following datasets:

Quora
Twitter

Extract and place them in the data directory. Path : data/<dataset-folder-name>. A sample dataset folder might look like data/quora/<train/test/val>/<src.txt/tgt.txt>.

Download GoogleNews-vectors-negative300.bin.gz into the data directory. In case the above link doesn't work, find the zip file here

Setup:

To get the project's source code, clone the github repository:

$ git clone https://github.com/malllabiisc/DiPS

Install VirtualEnv using the following (optional):

$ [sudo] pip install virtualenv

Create and activate your virtual environment (optional):

$ virtualenv -p python3 venv
$ source venv/bin/activate

Install all the required packages:

$ pip install -r requirements.txt

Install the submodopt package by running the following command from the root directory of the repository:

$ cd ./packages/submodopt
$ python setup.py install
$ cd ../../

Training the sequence to sequence model

python -m src.main -mode train -gpu 0 -use_attn -bidirectional -dataset quora -run_name <run_name>

Create dictionary for submodular subset selection. Used for Semantic similarity (L₂)

To use trained embeddings -

python -m src.create_dict -model trained -run_name <run_name> -gpu 0

To use pretrained word2vec embeddings -

python -m src.create_dict -model pretrained -run_name <run_name> -gpu 0

This will generate the word2vec.pickle file in data/embeddings

Decoding using submodularity

python -m src.main -mode decode -selec submod -run_name <run_name> -beam_width 10 -gpu 0

Citation

Please cite the following paper if you find this work relevant to your application

@inproceedings{dips2019,
    title = "Submodular Optimization-based Diverse Paraphrasing and its Effectiveness in Data Augmentation",
    author = "Kumar, Ashutosh  and
      Bhattamishra, Satwik  and
      Bhandari, Manik  and
      Talukdar, Partha",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1363",
    pages = "3609--3619"
}

For any clarification, comments, or suggestions please create an issue or contact ashutosh@iisc.ac.in or Satwik Bhattamishra

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Also on GEM/NL-Augmenter 🦎 → 🐍

Dependencies

Dataset

Setup:

Training the sequence to sequence model

Create dictionary for submodular subset selection. Used for Semantic similarity (L₂)

Decoding using submodularity

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Also on GEM/NL-Augmenter 🦎 → 🐍

Dependencies

Dataset

Setup:

Training the sequence to sequence model

Create dictionary for submodular subset selection. Used for Semantic similarity (L2)

Decoding using submodularity

Citation

Create dictionary for submodular subset selection. Used for Semantic similarity (L₂)