A part-of-speech/morphological tagger and a dependency parser that can be trained separately or jointly.
The code was originally written for tagging and parsing Ancient Greek, and the preprocessing is tailored accordingly. The methods containing language-specific details are normalize_tokens and add_language_specific_tokens in utils.py. These can be modified to support any language.
The parser is a shallow version of Dozat & Manning's 2017 deep biaffine attention parser, as devised by Glavaš and Vulić (2021b).
The implementation builds on TowerParse (Glavaš and Vulić, 2021a). Credit is given in the source code where relevant.
If you use this code for something academic, feel free to cite the master's thesis that it was written for:
Bjerkeland, D. C. 2022. Tagging and Parsing Old Texts with New Techniques. University of Oslo. URL: http://urn.nb.no/URN:NBN:no-98954.
- Ancient Greek (joint); trained on PROIEL (UD):
clemeth/ancient-greek-bert-finetuned-proiel-tag-parse
To train a model, perform the following steps:
- Dependencies can be installed with
pipenvusing thePipfiledirectly. Otherwise, inspect the file and install them how you please. - Extract the label vocabulary from your training files using the
extract_vocabsmethod inutils.py. - Edit
config.pyto your needs. The following parameters are available:Parameter Data type Description batch_sizeintThe batch size. bert_lrfloatThe learning rate for the BERT parameters. classifier_lrfloatThe learning rate for the classifier parameters in a separate model. parser_lrfloatThe learning rate for the parser parameters in a joint model. tagger_lrfloatThe learning rate for the tagger parameters in a joint model. casedboolTrueto keep letter case in training data.Falseto lowercase.devicestrThe processing device to run on. 'cpu'for a regular CPU, and typically something like'cuda:0'for GPUs.early_stopintThe number of epochs after which to quit training if validation loss does not decrease. epochsintThe maximum number of epochs to run training for. expand_iotaboolTrueto adscript iota supscripts.Falseto do nothing.expand_roughboolTrueto add a heta to words with rough breathing.Falseto do nothing.ignore_punctboolTrueto ignore punctuation and gap tokens during evaluation withtest.py.Falseto include.last_layer_dropoutfloatThe dropout probability of the last layer. max_subword_lenintThe maximum number of subword tokens per sentence. Needs to be higher than the maximum number of subword tokens that any sentence in the data is tokenized into. The sentences can be pruned using the write_shortened_datasetmethod inutils.py.max_word_lenintThe maximum number of word tokens per sentence. Needs to be higher than the longest sentence in the data. modestr'tag','parse', or'joint'.model_namestrThe Hugging Face path or local path to the transformer model to use. The bundled tokenizer will also be loaded. models_pathstrPath to where models are saved. namestrA name for the model to be trained/loaded. num_warmup_stepsintThe number of warmup steps for the optimizer. pad_valueintA pad value. Needs to be negative. print_goldboolTrueto write the gold annotation to file when a prediction doesn't match during evaluation withtest.py.schedulerstrThe type of scheduler to load from the get_schedulermethod fromtransformers.seedintThe RNG seed. subword_prefixstrThe subword prefix used by the transformer model. test_pathstrPath to the CoNLL-U file with testing data. train_pathstrPath to the CoNLL-U file with training data. val_pathstrPath to the CoNLL-U file with validation data. vocabs_pathstrPath to the JSON file with the extracted label vocabulary (see step 1). - Run
train.py.
To evaluate a model, keep the same config file and run test.py.
- Bjerkeland, D. C. 2022. Tagging and Parsing Old Texts with New Techniques. University of Oslo. URL: http://urn.nb.no/URN:NBN:no-98954.
- Dozat, T. and C. D. Manning. 2017. Deep Biaffine Attention for Neural Dependency Parsing. Proceedings of ICLR 2017. URL: https://openreview.net/forum?id=Hk95PK9le.
- Glavaš, G. and I. Vulić. 2021a. Climbing the Tower of Treebanks: Improving Low-Resource Dependency Parsing via Hierarchical Source Selection. Findings of ACL-IJCNLP 2021, pp. 4878–4888. URL: https://dx.doi.org/10.18653/v1/2021.findings-acl.431.
- Glavaš, G. and I. Vulić. 2021b. Is Supervised Syntactic Parsing Beneficial for Language Understanding Tasks? An Empirical Investigation. Proceedings of ACL, pp. 3090–3104. URL: https://dx.doi.org/10.18653/v1/2021.eacl-main.270.