This work is built based on thai2transformers.
The environment is described in requirement.txt
Three extractive question answeirng datasets for the Thai language.
- xquad-th (the split version on https://huggingface.co/datasets/zhufy/xquad_split)
- thaiqa (https://huggingface.co/datasets/thaiqa_squad)
- iapp_wiki_qa_squad (https://huggingface.co/datasets/iapp_wiki_qa_squad)
Combine the training data from the three datasets for training. The script for combining the dataset is
./utils/combine_datasets.py
Include both mono-
and multi-
lingual language models that can work for the Thai language.
WangchenBERTa: we use the wangchanberta-base-att-spm-uncased released by the VISTEC-depa AI Research Institute of Thailand on HuggingFace.
- Multilingual BERT: we use the bert-base-multilingual-cased released on HuggingFace.
- XLM-Roberta: we use the xlm-roberta-base released on HuggingFace.
Run the script bash run_train.sh
to fine-tune the model.
The parameters can be set are as follows,
python train_combined.py \
--model_name wangchanberta-base-att-spm-uncased\
--dataset_name iapp_thaiqa_xquad\
--output_dir ~/saved/thai/combined/wangchanberta-base-att-spm-uncased \
--log_dir ./logs/combined-wangchanberta\
--lowercase \
--pad_on_right \
--fp16 \
--num_train_epochs 7\
--batch_size 50\
--gpu 7 \
--seed 42 \
train_base.py
Train baseline models with single dataset
train_combined.py
Train baseline model with the combined dataset
Run the script bash run_eval.sh
to evaluate the model.
The parameters can be set are as follows,
python eval_combined.py \
--model_name bert-base-multilingual-cased\
--eval_base \
--dataset_name iapp_wiki_qa_squad\
--output_dir ~/saved/thai/combined/bert-base-multilingual-cased\
--log_dir ./test/\
--lowercase \
--pad_on_right \
--fp16 \
--num_train_epochs 7\
--batch_size 50 \
--seed 42 \
--gpu 5\
eval.py
Evaluate the base models
eval_combined.py
Evaluate the models trained with the combined dataset
eval_huggingface
Evaluate a model released on HuggingFace