tagparse

A part-of-speech/morphological tagger and a dependency parser that can be trained separately or jointly.

The code was originally written for tagging and parsing Ancient Greek, and the preprocessing is tailored accordingly. The methods containing language-specific details are normalize_tokens and add_language_specific_tokens in utils.py. These can be modified to support any language.

The parser is a shallow version of Dozat & Manning's 2017 deep biaffine attention parser, as devised by Glavaš and Vulić (2021b).

The implementation builds on TowerParse (Glavaš and Vulić, 2021a). Credit is given in the source code where relevant.

If you use this code for something academic, feel free to cite the master's thesis that it was written for:

Bjerkeland, D. C. 2022. Tagging and Parsing Old Texts with New Techniques. University of Oslo. URL: http://urn.nb.no/URN:NBN:no-98954.

Trained models

Ancient Greek (joint); trained on PROIEL (UD): clemeth/ancient-greek-bert-finetuned-proiel-tag-parse

Instructions

To train a model, perform the following steps:

Dependencies can be installed with pipenv using the Pipfile directly. Otherwise, inspect the file and install them how you please.
Extract the label vocabulary from your training files using the extract_vocabs method in utils.py.

Edit config.py to your needs. The following parameters are available:

Parameter	Data type	Description
`batch_size`	`int`	The batch size.
`bert_lr`	`float`	The learning rate for the BERT parameters.
`classifier_lr`	`float`	The learning rate for the classifier parameters in a separate model.
`parser_lr`	`float`	The learning rate for the parser parameters in a joint model.
`tagger_lr`	`float`	The learning rate for the tagger parameters in a joint model.
`cased`	`bool`	`True` to keep letter case in training data. `False` to lowercase.
`device`	`str`	The processing device to run on. `'cpu'` for a regular CPU, and typically something like `'cuda:0'` for GPUs.
`early_stop`	`int`	The number of epochs after which to quit training if validation loss does not decrease.
`epochs`	`int`	The maximum number of epochs to run training for.
`expand_iota`	`bool`	`True` to adscript iota supscripts. `False` to do nothing.
`expand_rough`	`bool`	`True` to add a heta to words with rough breathing. `False` to do nothing.
`ignore_punct`	`bool`	`True` to ignore punctuation and gap tokens during evaluation with `test.py`. `False` to include.
`last_layer_dropout`	`float`	The dropout probability of the last layer.
`max_subword_len`	`int`	The maximum number of subword tokens per sentence. Needs to be higher than the maximum number of subword tokens that any sentence in the data is tokenized into. The sentences can be pruned using the `write_shortened_dataset` method in `utils.py`.
`max_word_len`	`int`	The maximum number of word tokens per sentence. Needs to be higher than the longest sentence in the data.
`mode`	`str`	`'tag'`, `'parse'`, or `'joint'`.
`model_name`	`str`	The Hugging Face path or local path to the transformer model to use. The bundled tokenizer will also be loaded.
`models_path`	`str`	Path to where models are saved.
`name`	`str`	A name for the model to be trained/loaded.
`num_warmup_steps`	`int`	The number of warmup steps for the optimizer.
`pad_value`	`int`	A pad value. Needs to be negative.
`print_gold`	`bool`	`True` to write the gold annotation to file when a prediction doesn't match during evaluation with `test.py`.
`scheduler`	`str`	The type of scheduler to load from the `get_scheduler` method from `transformers`.
`seed`	`int`	The RNG seed.
`subword_prefix`	`str`	The subword prefix used by the transformer model.
`test_path`	`str`	Path to the CoNLL-U file with testing data.
`train_path`	`str`	Path to the CoNLL-U file with training data.
`val_path`	`str`	Path to the CoNLL-U file with validation data.
`vocabs_path`	`str`	Path to the JSON file with the extracted label vocabulary (see step 1).

Run train.py.

To evaluate a model, keep the same config file and run test.py.

Bibliography

Bjerkeland, D. C. 2022. Tagging and Parsing Old Texts with New Techniques. University of Oslo. URL: http://urn.nb.no/URN:NBN:no-98954.
Dozat, T. and C. D. Manning. 2017. Deep Biaffine Attention for Neural Dependency Parsing. Proceedings of ICLR 2017. URL: https://openreview.net/forum?id=Hk95PK9le.
Glavaš, G. and I. Vulić. 2021a. Climbing the Tower of Treebanks: Improving Low-Resource Dependency Parsing via Hierarchical Source Selection. Findings of ACL-IJCNLP 2021, pp. 4878–4888. URL: https://dx.doi.org/10.18653/v1/2021.findings-acl.431.
Glavaš, G. and I. Vulić. 2021b. Is Supervised Syntactic Parsing Beneficial for Language Understanding Tasks? An Empirical Investigation. Proceedings of ACL, pp. 3090–3104. URL: https://dx.doi.org/10.18653/v1/2021.eacl-main.270.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
Pipfile		Pipfile
README.md		README.md
config.py		config.py
models.py		models.py
test.py		test.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

tagparse

Trained models

Instructions

Bibliography

About

Uh oh!

Uh oh!

Languages

Uh oh!

Uh oh!

clemeth/tagparse

Folders and files

Latest commit

History

Repository files navigation

tagparse

Trained models

Instructions

Bibliography

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages