Multi-view Subword Regularization

This repository contains the implementation for Multi-view Subword Regularization.

Multi-view Subword Regularization
Xinyi Wang, Sebastian Ruder, Graham Neubig
NAACL 2021

Our code is based on the XTREME benchmark

Introduction

Multilingual pretrained models uses a single subword segmentation model on data from hundreds of languages. This often lead suboptimal subword segmentations which hinders effective cross-lingual transfer. In this paper we propose a simple and efficient subword regularization approach at fine-tuning time of pretrained models. It utilizes both deterministic and probabilisitc segmentations of a input and enforces the consistency between the two.

Main method implementation

We implement Multi-view Subword Regularization(MVR) for several different tasks. For example, the main logic of MVR for sequence tagging is here.

Download the data

We simply use the data downloading instruction from the official XTREME repo.

To install the dependencies:

bash install_tools.sh

The next step is to download the data. To this end, first create a download folder with mkdir -p download in the root of this project. You then need to manually download panx_dataset (for NER) from here (note that it will download as AmazonPhotos.zip) to the download directory. Finally, run the following command to download the remaining datasets:

bash scripts/download_data.sh

Wikiann named entity recognition

For named entity recognition (NER), we use data from the Wikiann (panx) dataset. To fine-tune a pretrained multilingual model on English using Multi-view Subword Regularization:

bash mvr_scripts/train_mvr_panx.sh [MODEL]

PAXS-X sentence classification

For sentence classification, we use the Cross-lingual Paraphrase Adversaries from Word Scrambling (PAWS-X) dataset. You can fine-tune a pre-trained multilingual model on the English PAWS data with the following command:

bash mvr_scripts/train_mvr_pawsx.sh [MODEL]

XNLI sentence classification

The second sentence classification dataset is the Cross-lingual Natural Language Inference (XNLI) dataset. You can fine-tune a pre-trained multilingual model on the English MNLI data with the following command:

bash mvr_scripts/train_mvr_xnli.sh [MODEL]

XQuAD, MLQA question answering

For question answering, we use the data from the XQuAD, MLQA Passage datasets. For XQuAD and MLQA, the model should be trained on the English SQuAD training set.

bash mvr_scripts/train_mvr_qa.sh [MODEL]

Paper

Please cite our paper \cite{wang2021multiview}.

@inproceedings{wang2021multiview,
      author    = {Xinyi Wang and Sebastian Ruder and Graham Neubig},
      title     = {Multi-view Subword Regularization},
      year      = {2021},
      booktitle = {NAACL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
mock_test_data		mock_test_data
mvr_scripts		mvr_scripts
scripts		scripts
third_party		third_party
.gitignore		.gitignore
CONTRIBUTING		CONTRIBUTING
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
conda-env.txt		conda-env.txt
evaluate.py		evaluate.py
evaluate_test.py		evaluate_test.py
install_tools.sh		install_tools.sh
utils_preprocess.py		utils_preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-view Subword Regularization

Introduction

Main method implementation

Download the data

Wikiann named entity recognition

PAXS-X sentence classification

XNLI sentence classification

XQuAD, MLQA question answering

Paper

About

Releases

Packages

Languages

License

cindyxinyiwang/multiview-subword-regularization

Folders and files

Latest commit

History

Repository files navigation

Multi-view Subword Regularization

Introduction

Main method implementation

Download the data

Wikiann named entity recognition

PAXS-X sentence classification

XNLI sentence classification

XQuAD, MLQA question answering

Paper

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages