A graph-based dependency parser with a Transformer-based encoder, implemented using transformers.
$ git clone https://github.com/chantera/transformers-parser
$ cd transformers-parser
$ pip install -r requirements.txt
Training and inference are performed using src/train.py.
A dataset used for the training script must be a collection of entries, each of which consists of the following fields:
text
(string): raw textform
(list of string): wordshead
(list of integer): head indicesdeprel
(list of string): dependency relations
Below is an example represented in JSON:
{
"text": "Tokyo is the capital of Japan.",
"form": ["Tokyo", "is", "the", "capital", "of", "Japan", "."],
"head": [4, 4, 4, 0, 6, 4, 4],
"deprel": ["nsubj", "cop", "det", "root", "case", "nmod", "punct"]
}
A dataset can be formatted in JSON Lines or prepared using a dataset loading script, as provided in data/ptb_wsj and data/ud.
Notes:
- For JSON Lines datasets,
train.jsonl
,validation.jsonl
, andtest.jsonl
files must be placed in the dataset directory. - For PTB,
train.conll
,validation.conll
, andtest.conll
files must be placed in data/ptb_wsj. - For UD, train/dev/test splits are automatically downloaded from Universal Dependencies.
The training script utilizes transformers.Trainer
(See the official documentation for details).
Other than transformers.TrainingArguments
, you can specify dataset
and model
.
Below is an example of training a parser on the Penn Treebank:
$ torchrun --nproc_per_node 4 src/train.py \
--dataset ./data/ptb_wsj \
--model roberta-large \
--output_dir ./output \
--num_train_epochs 10 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 32 \
--learning_rate 5e-5 \
--warmup_ratio 0.1 \
--max_grad_norm 0.5 \
--eval_strategy epoch \
--eval_steps 1 \
--save_strategy epoch \
--save_steps 1 \
--save_total_limit 5 \
--metric_for_best_model UAS \
--load_best_model_at_end \
--output_logits_length 128 \
--do_train \
--do_eval \
--do_predict \
--seed 42
Notes:
- The default values for
transformers.TrainingArguments
are redefined in training.conf. output_logits_length
must be specified in distributed training to align logits in length.- UAS/LAS scores evaluated through the training script are not calculated using the CoNLL evaluation scripts, and thus some special treatments, such as for punctuation, are not taken into account. See Evaluation for calculating official UAS/LAS scores.
UAS/LAS scores using the CoNLL evaluation scripts can be evaluated using eval/evaluate.py. Below is an example of evaluating predictions on the Penn Treebank:
$ python eval/evaluate.py ./data/ptb_wsj/test.conll ./output/test_predictions.jsonl
Model | UAS | LAS |
---|---|---|
roberta-large | 97.30 | 95.75 |