Text pair classification toolkit.
-
Transform the dataset to the standard formation . We currently support snli, qnli and quoraqp. Please write your own transformation scripts for other dataset.
python lion/data/dataset_utils/quoraqp.py convert-dataset --indir INDIR --outdir OUTDIR
-
Preprocess the dataset.
python lion/data/processor.py process-dataset --in_dir IN_DIR --out_dir OUT_DIR --splits ['train'|'dev'|'test'] --tokenizer_name [spacy/bert/xlnet] --vocab_file FILE_PATH --max_length SEQUENCE_LENGTH
- Create a directory for saving model and put the config file in it .
- Edit the config file, modifying the train file and dev file path .
- Run lion/training/trainer.py
For example:
python lion/training/trainer.py --train --output_dir experiments/QQP/esim/
.
- Create a directory for saving model and put the config file in it .
- Edit the config file, modifying the train file and dev file path .
- Edit the tuned_params.yaml For example:
hidden_size:
- 100
- 200
- 300
dropout:
- 0.1
- 0.2
- Run
python lion/training/search_parameter.py --parent_dir experiments/QQP/esim/hidden_dim/
python lion/training/trainer.py --evaluate --output_dir experiments/QQP/esim/ --dev_file your_dev_path
python lion/training/trainer.py --predict --output_dir experiments/QQP/esim/ --test_file your_test_file
Model | Quora QP | SNLI | QNLI |
---|---|---|---|
BiMPM | 86.9 | 86.0 | 80.5 |
Esim | 88.4 | 87.4 | 81.4 |
BERT | 91.3 | 91.1 | 91.7 |
XLNET | 91.5 | 91.6 | 91.9 |
Note: All the performance in the above table is tested on the dev set. The hyperparameter we used for these models
are all in the experiments/DATASET/MODEL
directory.
Just write this in your config file: use_elmo: concat or only
and remember to set the word_dim
correctly.
For example if you use ELMO embedding only, set the word_dim: 1024
or set the word_dim: 1324
if you use ELMO and GLove together.