Skip to content

Model-Agnostic Meta Learning for Multilingual Dependency Parsing

Notifications You must be signed in to change notification settings

Chung-I/maml-parsing

Repository files navigation

Meta-learning for Low-Resource Dependency Parsing

Data

All of the experiments were conducted on Universal Dependencies:

Setting up the environment

  1. Clone this repository:
git clone https://github.com/Chung-I/maml-parsing.git
  1. Set up conda environment:
conda create -n maml-parsing python=3.6
conda activate maml-parsing
  1. Install python package requirements:
pip install -r requirements.txt

Pre-training:

  • UD_GT: Root path of ground truth universal dependencies treebank files used for evaluation.
  • UD_ROOT: Root path of treebank files used for training. For scenarios that use ground truth universal dependencies treebank files for training, simply set it the same as UD_GT. For those who would like to use their own POS taggers as input features for training, put all pos-tagged conllu files in a singler folder and set UD_ROOT to it. We provide Universal Dependencies v2.2 preprocessed by stanfordnlp (stanfordnlp package) for those who would like to compare their result with paper), which use predicted tags of their POS taggers for training.
  • CONFIG_NAME: json file storing training configuration such as dataset paths, model hyperparameter settings, training schedule, etc. See delexicalized parsing models and lexicalized parsing models for examples of configuration files to choose from.
  • Normal usage: Simply extract Universal Dependencies v2.2 to some folder, then set UD_GT="folder/**/" and UD_ROOT="folder/**/".
UD_GT="path/to/your/ud-treebanks-v2.2/**/" UD_ROOT="path/to/your/pos-tagged/conllu-files/" python -W ignore run.py train $CONFIG_NAME -s <serialization_dir> --include-package src

delexicalized parsing models:

lexicalized parsing models:

hyperparameters:

  • num_gradient_accumulation_steps: meta-learning inner steps

Zero-shot Transfer

  • UD_GT: Same as pre-training.
  • UD_ROOT: Root path of treebank files used for testing. For scenarios that use ground truth text segmentation and POS tags as inputs to the parser, simply set it the same as UD_GT. For users who would like to compare their results with CoNLL 2018 shared task submission, which scores not only parser accuracies but also the whole preprocessing pipeline (tokenization, lemmatization, POS/morphological features tagging, multi-word expansion) before dependency parsing, they can use their own preprocessing pipeline to process raw text and put all preprocessed conllu files in a singler folder and set UD_ROOT to it. The parser will read the test files in it to generate system output. For users who don't want to develop their own preprocessing pipeline but still want to compare their result with CoNLL 2018 submission, we provide preprocessed Universal Dependencies v2.2 by stanfordnlp preprocessing pipeline (stanfordnlp package). Preprocessed Universal Dependencies v2.5 by stanza preprocessing pipeline (stanza package) is also provided for users who'd like to parse treebanks in UD v2.5 and compare their results with stanza, stanford's multilingual NLP system trained on UD v2.5.
  • EPOCH_NUM: Which pre-training epoch checkpoint to perform zero-shot transfer from.
  • ZS_LANG: Language code of target transfer language (e.g. wo, te, cop, ..., etc.).
  • SUFFIX: Suffix of folder names storing results.
  • <serialization_dir>: Directory of model to perform zero-shot transfer from. For example, if one would like to perform zero-shot transfer from the pos-only multi-task baseline model, simply extract pre-trained model multi-pos.tar.gz and set <serialization_dir> to that folder.
UD_GT="path/to/your/ud-treebanks-v2.x/**/" UD_ROOT="path/to/your/preprocessed/conllu-files/" bash zs-eval.sh <serialization_dir> $EPOCH_NUM $ZS_LANG 0 $SUFFIX

Results will be stored in log dir: <serialization_dir>_${EPOCH_NUM}_${ZS_LANG}_${SUFFIX}.

Fine-tuning

  • UD_GT: Same as pre-training.
  • UD_ROOT: Same as zero-shot transfer.
  • EPOCH_NUM: Which pre-training epoch checkpoint to perform fine-tuning from.
  • ZS_LANG: Code of target transfer language (e.g. wo, te, cop, ..., etc.).
  • NUM_EPOCHS: Perform fine-tuning for this many number of epochs.
  • SUFFIX: Suffix of folder names storing results.
  • <serialization_dir>: Directory of model to perform fine-tuning from. For example, if one would like to perform fine-tuning from the pos-only multi-task baseline model, simply extract pre-trained model multi-pos.tar.gz and set <serialization_dir> to that folder.
UD_GT="path/to/your/ud-treebanks-v2.x/**/" UD_ROOT="path/to/your/preprocessed/testset/" bash fine-tune.sh <serialization_dir> $EPOCH_NUM $FT_LANG $NUM_EPOCHS $SUFFIX

Results will be stored in log dir: <serialization_dir>_${EPOCH_NUM}_${FT_LANG}_${SUFFIX}.

Files in log directory

  • train-result.conllu: System prediction of training set ($UD_GT/$ZS_LANG*-train.conllu).
  • dev-result.conllu: System prediction of development set ($UD_GT/$ZS_LANG*-dev.conllu).
  • result.conllu: System prediction of testing set ($UD_ROOT/$ZS_LANG*-test.conllu).
  • result-gt.conllu: System prediction of testing set ($UD_GT/$ZS_LANG*-test.conllu).
  • result.txt: Performance (LAS, UAS, etc.) of result.conllu computed by utils/conll18_ud_eval.py, which is provided by CoNLL 2018 Shared Task.
  • result-gt.txt: Performance (LAS, UAS, etc.) of result-gt.conllu computed by utils/error_analysis.py, which is modified from CoNLL 2018 Shared Task. Scores grouped by sentence length (LASlen[sentence length lower bound][sentence length upper bound]) and dependency length(LASdep[dependency length]) are added.

About

Model-Agnostic Meta Learning for Multilingual Dependency Parsing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published