LAUG

LAUG is an open-source toolkit for Language understanding AUGmentation. It is an automatic method to approximate the natural perturbations to existing data. Augmented data could be used to conduct black-box robustness testing or enhancing training. [paper]

LAUG

Installation

Require python 3.6.

Clone this repository:

git clone https://github.com/thu-coai/LAUG.git

Install via pip:

cd LAUG
pip install -e .

Download data and models:

The data used in our paper and model parameters pre-trained by us are available at Link. Please download and place them into corresponding dir. For model parameters released by others, please refer to README.md under dirs of each augmentation method such as LAUG/aug/Speech_Recognition/README.md.

Augmentation Methods

Here are the 4 augmentation methods described in our paper. They are placed under LAUG/aug dir.

Word Perturbation (WP), at Word_Perturbation/ dir.
Text Paraphrasing (TP), at Text_Paraphrasing/dir.
Speech Recognition (SR), at Speech_Recognition/dir.
Speech Disfluency (SD), at Speech_Disfluency/dir.

Please see our paper and README.md in each augmentation method for detailed information.

See demo.py for the usage of these augmentation methods.

python demo.py

Noting that our augmentation methods contains several neural models, pre-trained parameters need to be downloaded before use. Parameters pre-trained by us are available at Link. For parameters which released by others, please follow the instructions of each method.

Supported Datasets

The data used in our paper is available at Link . Please download it and place it data/ dir.

Our data contains 2 datasets: MultiWOZ and Frames, along with their augmented copies.

MultiWOZ
- Original data
  - We use MultiWOZ 2.3 as the original data. We place it at data/multiwoz/ dir.
  - Train/val/test size: 8434/999/1000 dialogs.
  - LICENSE:
- Augmented data
  - We have 4 augmented testsets :
    - WP (Word Perturbation), size: 1000, placed at data/multiwoz/WP.
    - TP (Text Paraphrasing), size: 1000, placed at data/multiwoz/TP.
    - SR (Speech Perturbation), size: 1000, placed at data/multiwoz/SR.
    - SD (Speech Disfluency), size: 1000, placed at data/multiwoz/SD.
  - We have 1 augmented training set :
    - Size : 16868 , Contains : 50%Original+(12.5%WP+12.5%TP+12.5%SR+12.5%SD) , placed at data/multiwoz/Enhanced.
- Real user evaluation data :
  - We collected 240 utterance from real users for our real user evaluation.
  - We place it at data/multiwoz/Real dir.
  - Please see our paper for detailed information about the statistics and collection of the real data.
Frames
- Original data
  - We proccess Frames into the same format as MultiWOZ and place it at data/Frames/ dir.
  - Train/val/test size: 1095/137/137 dialogs.
  - LICENSE:
- Augmented data
  - We have 4 augmented testsets :
    - WP (Word Perturbation), size: 137, placed at data/Frames/WP.
    - TP (Text Paraphrasing), size: 137, placed at data/Frames/TP.
    - SR (Speech Perturbation), size: 137, placed at data/Frames/SR.
    - SD (Speech Disfluency), size: 137, placed at data/Frames/SD.
  - We have 1 augmented training set :
    - Size : 2190 , Contains : 50%Original+(12.5%WP+12.5%TP+12.5%SR+12.5%SD) , placed at data/Frames/Enhanced.

NLU Models

We provide four base NLU models which are described in our paper:

MILU
BERT
CopyNet
GPT-2

These models are adapted from ConvLab-2. For more details, You can refer to README.md under LUAG/nlu/$model/$dataset dir such as LAUG/nlu/gpt/multiwoz/README.md.

Citing

If you use LAUG in your research, please cite:

@inproceedings{liu2021robustness,
    title={Robustness Testing of Language Understanding in Task-Oriented Dialog},
    author={Liu, Jiexi and Takanobu, Ryuichi and Wen, Jiaxin and Wan, Dazhen and Li, Hongguang and Nie, Weiran and Li, Cheng and Peng, Wei and Huang, Minlie},
    year={2021},
    booktitle={Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics},
}

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LAUG		LAUG
data		data
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LAUG

Installation

Augmentation Methods

Supported Datasets

NLU Models

Citing

License

About

Releases

Packages

Contributors 2

Languages

License

thu-coai/LAUG

Folders and files

Latest commit

History

Repository files navigation

LAUG

Installation

Augmentation Methods

Supported Datasets

NLU Models

Citing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages