Saya-ang, K., Hamor, M. G., Gozum, D. J., & Mabansag, R. K. Bidirectional Encoder Representation from Transformer Tagalog Part of Speech Tagger [Computer software]. https://github.com/syke9p3/bert-tagalog-pos-tagger
This repository contains the training and testing Python files for fine-tuning gklmip/bert-tagalog-base-uncased model for Tagalog part of speech tagging
- Developed by: Saya-ang, Kenth G. (@syke9p3) | Gozum, Denise Julianne S. (@Xenoxianne) | Hamor, Mary Grizelle D. (@mnemoria) | Mabansag, Ria Karen B. (@riavx)
- Model type: BERT Tagalog Base Uncased
- Programming Language: Python
- Languages (NLP): Tagalog, Filipino
- Dataset: Sagum et. al.'s annotated Tagalog Corpora based on MGNN Tagset convention. This model was trained in 800 sentences and evaluated with 200 sentences.
- Finetuned from model: Jiang et. al.'s pre-trained bert-tagalog-base-uncased model
Try the model: HuggingFace Spaces
Model source code: HuggingFace
- PyTorch
- Regular Expressions
- Transformers
- SKLearn Metrics
- Datasets
- tqdm
A corpus was used containing tagged sentences in Tagalog language. The dataset comprises sentences with each word annotated with its corresponding POS tag in the format of <TAG word>
. To prepare the corpus for training, the following preprocessing steps were performed:
- Removal of Line Identifier: the line identifier, such as
SNT.108970.2066
, was removed from each tagged sentence. - Symbol Conversion: for the BERT model, certain special symbols like hyphens, quotes, commas, etc., were converted into special tokens (
PMP
,PMS
,PMC
) to preserve their meaning during tokenization. - Alignment of Tokenization: the BERT tokenized words and their corresponding POS tags were aligned to ensure that the tokenization and tagging are consistent.
The BERT Tagalog POS Tagger were trained using PyTorch library with the following hyperparameters set:
Hyperparamter | Value |
---|---|
Batch Size | 8 |
Training Epoch | 5 |
Learning-rate | 2e-5 |
Optimizer | Adam |
For the test sentences, almost the same preprocessing and tokenization steps as in training were performed, but without the need to extract POS tags from the sentence. The trained model was loaded to generate the tags for the input sentence along with Gradio to provide an interface for displaying the POS tag results.