transformers-parser

A graph-based dependency parser with a Transformer-based encoder, implemented using transformers.

Installation

$ git clone https://github.com/chantera/transformers-parser
$ cd transformers-parser
$ pip install -r requirements.txt

Usage

Training and inference are performed using src/train.py.

Data Preparation

A dataset used for the training script must be a collection of entries, each of which consists of the following fields:

text (string): raw text
form (list of string): words
head (list of integer): head indices
deprel (list of string): dependency relations

Below is an example represented in JSON:

{
  "text": "Tokyo is the capital of Japan.",
  "form": ["Tokyo", "is", "the", "capital", "of", "Japan", "."],
  "head": [4, 4, 4, 0, 6, 4, 4],
  "deprel": ["nsubj", "cop", "det", "root", "case", "nmod", "punct"]
}

A dataset can be formatted in JSON Lines or prepared using a dataset loading script, as provided in data/ptb_wsj and data/ud.

Notes:

For JSON Lines datasets, train.jsonl, validation.jsonl, and test.jsonl files must be placed in the dataset directory.
For PTB, train.conll, validation.conll, and test.conll files must be placed in data/ptb_wsj.
For UD, train/dev/test splits are automatically downloaded from Universal Dependencies.

Training

The training script utilizes transformers.Trainer (See the official documentation for details). Other than transformers.TrainingArguments, you can specify dataset and model. Below is an example of training a parser on the Penn Treebank:

$ torchrun --nproc_per_node 4 src/train.py \
    --dataset ./data/ptb_wsj \
    --model roberta-large \
    --output_dir ./output \
    --num_train_epochs 10 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 32 \
    --learning_rate 5e-5 \
    --warmup_ratio 0.1 \
    --max_grad_norm 0.5 \
    --eval_strategy epoch \
    --eval_steps 1 \
    --save_strategy epoch \
    --save_steps 1 \
    --save_total_limit 5 \
    --metric_for_best_model UAS \
    --load_best_model_at_end \
    --output_logits_length 128 \
    --do_train \
    --do_eval \
    --do_predict \
    --seed 42

Notes:

The default values for transformers.TrainingArguments are redefined in training.conf.
output_logits_length must be specified in distributed training to align logits in length.
UAS/LAS scores evaluated through the training script are not calculated using the CoNLL evaluation scripts, and thus some special treatments, such as for punctuation, are not taken into account. See Evaluation for calculating official UAS/LAS scores.

Evaluation

UAS/LAS scores using the CoNLL evaluation scripts can be evaluated using eval/evaluate.py. Below is an example of evaluating predictions on the Penn Treebank:

$ python eval/evaluate.py ./data/ptb_wsj/test.conll ./output/test_predictions.jsonl

Performance

PTB

Model	UAS	LAS
roberta-large	97.30	95.75

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
eval		eval
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
training.conf		training.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

transformers-parser

Installation

Usage

Data Preparation

Training

Evaluation

Performance

PTB

About

Releases

Packages

Languages

License

chantera/transformers-parser

Folders and files

Latest commit

History

Repository files navigation

transformers-parser

Installation

Usage

Data Preparation

Training

Evaluation

Performance

PTB

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages