-
Notifications
You must be signed in to change notification settings - Fork 273
Model Cards
These model cards contain technical details of the models developed and used in PyThaiNLP.
Model Details
- Developer: Wannaphong Phatthiyaphaibun
- Model date: 2020-10-03
- Model version: 0.2
- Used in PyThaiNLP version: 2.2.4 +
- Filename:
~/pythainlp-data/cls-v0.2.crfsuite
- GitHub: https://github.com/PyThaiNLP/pythainlp/pull/479
- CRF Model
- License: CC0
Intended Use
- Segmenting Thai text into clauses (smaller than a sentence but bigger than a word)
- Not suitable for other language or non-news domain.
Factors
- Based on known problems with thai natural Language processing.
Metrics
- Evaluation metrics include precision, recall and f1-score.
Training Data LST20 Corpus Train set (news domain)
Evaluation Data LST20 Corpus Test set (news domain)
Quantitative Analyses
precision recall f1-score support
B_CLS 0.90 0.94 0.92 16111
E_CLS 0.90 0.94 0.92 15947
I_CLS 0.99 0.97 0.98 169565
micro avg 0.97 0.97 0.97 201623
macro avg 0.93 0.95 0.94 201623
weighted avg 0.97 0.97 0.97 201623
samples avg 0.94 0.94 0.94 201623
Ethical Considerations no ideas
Caveats and Recommendations
- The user must perform word segmentation first before using this model.
- Thai text only
Model Details
- Developer: Chonlapat Patanajirasit
- Model date: 2020-05-09
- Model version: 1.0
- Used in PyThaiNLP version: 2.2 +
- Filename:
pythainlp/corpus/sentenceseg_crfcut.model
- GitHub: https://github.com/vistec-AI/crfcut
- CRF Model
- License: CC0
Intended Use
- Segmenting Thai text into sentences.
Factors
- Based on known problems with thai natural Language processing.
Metrics
- Evaluation metrics include precision, recall and f1-score.
Training Data ?
Evaluation Data ?
Quantitative Analyses ? Ethical Considerations no ideas
Caveats and Recommendations
- Thai text only
Model Details
- Developer: Wannaphong Phatthiyaphaibun
- Model date: 2020-5-21
- Model version: 1.4
- Used in PyThaiNLP version: 2.2 +
- Filename:
~/pythainlp-data/thai-ner-1-4.crfsuite
- CRF Model
- License: CC0
- GitHub for Thai NER 1.4 (Data and train notebook): https://github.com/wannaphong/thai-ner/tree/master/model/1.4
Intended Use
- Named-Entity Tagging for Thai.
- Not suitable for other language or non-news domain.
Factors
- Based on known problems with thai natural Language processing.
Metrics
- Evaluation metrics include precision, recall and f1-score.
Training Data ThaiNER 1.3 Corpus Train set
Evaluation Data ThaiNER 1.3 Corpus Test set
Quantitative Analyses
precision recall f1-score support
precision recall f1-score support
B-DATE 0.92 0.86 0.89 375
I-DATE 0.94 0.94 0.94 747
B-EMAIL 1.00 1.00 1.00 5
I-EMAIL 1.00 1.00 1.00 28
B-LAW 0.71 0.56 0.62 43
I-LAW 0.74 0.70 0.72 154
B-LEN 0.96 0.93 0.95 29
I-LEN 0.98 0.94 0.96 69
B-LOCATION 0.88 0.77 0.82 864
I-LOCATION 0.86 0.73 0.79 852
B-MONEY 0.98 0.85 0.91 105
I-MONEY 0.96 0.95 0.95 239
B-ORGANIZATION 0.90 0.78 0.84 1166
I-ORGANIZATION 0.84 0.77 0.81 1338
B-PERCENT 1.00 0.97 0.99 34
I-PERCENT 1.00 0.96 0.98 51
B-PERSON 0.96 0.82 0.88 676
I-PERSON 0.94 0.92 0.93 2424
B-PHONE 1.00 0.72 0.84 29
I-PHONE 0.96 0.92 0.94 78
B-TIME 0.87 0.73 0.79 172
I-TIME 0.94 0.83 0.88 336
B-URL 0.89 1.00 0.94 24
I-URL 0.96 1.00 0.98 371
B-ZIP 1.00 1.00 1.00 4
micro avg 0.91 0.84 0.87 10213
macro avg 0.93 0.87 0.89 10213
weighted avg 0.91 0.84 0.87 10213
samples avg 0.17 0.17 0.17 10213
Ethical Considerations no ideas
Caveats and Recommendations
- Thai text only
Model Details
- Developer: Wannaphong Phatthiyaphaibun
- Model date: 2018-5-15
- Model version: 1.0
- Used in PyThaiNLP version: 1.7 +
- Filename:
pythainlp/corpus/pos_orchid_perceptron.json
- perceptron model
- License: CC0
- train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_orchid_postag_pythainlp.ipynb
Intended Use
- Part of speech for Thai.
- Not suitable for other language or other domain of orchid corpus.
Factors
- Based on known problems with thai natural Language processing.
Metrics
- Evaluation metrics include precision, recall and f1-score.
Training Data Orchid Corpus
Evaluation Data Orchid Corpus
Quantitative Analyses
No data
Ethical Considerations no ideas
Caveats and Recommendations
- Thai word token only
Model Details
- Developer: Wannaphong Phatthiyaphaibun
- Model date: 2020-8-11
- Model version: 0.2.3
- Used in PyThaiNLP version: 2.2.5 +
- Filename:
pythainlp/corpus/pos_lst20_perceptron-v0.2.3.json
- perceptron model
- License: CC0
- train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_lst20_pythainlp.ipynb
Intended Use
- Part of speech for Thai.
- Not suitable for other language or other domain of LST20 corpus.
Factors
- Based on known problems with thai natural Language processing.
Metrics
- Evaluation metrics include precision, recall and f1-score.
Training Data
LST20 Corpus Train set
Evaluation Data
LST20 Corpus Test set
Quantitative Analyses
precision recall f1-score support
AJ 0.90 0.87 0.88 4403
AV 0.88 0.79 0.83 6722
AX 0.95 0.94 0.95 7556
CC 0.94 0.97 0.95 17613
CL 0.87 0.85 0.86 3739
FX 0.99 0.99 0.99 6918
IJ 1.00 0.25 0.40 4
NG 1.00 1.00 1.00 1694
NN 0.97 0.98 0.98 58568
NU 0.98 0.98 0.98 6256
PA 0.88 0.89 0.88 194
PR 0.88 0.85 0.86 2139
PS 0.94 0.93 0.94 10886
PU 1.00 1.00 1.00 37973
VV 0.95 0.97 0.96 42586
XX 0.00 0.00 0.00 27
accuracy 0.96 207278
macro avg 0.88 0.83 0.84 207278
weighted avg 0.96 0.96 0.96 207278
Ethical Considerations no ideas
Caveats and Recommendations
- Thai word token only
Model Details
- Developer: Wannaphong Phatthiyaphaibun
- Model date: 2020-12-29
- Model version: 0.1
- Used in PyThaiNLP version: 2.3+
- GitHub: https://github.com/PyThaiNLP/pythainlp/pull/511
- License: CC0
- train notebook: https://github.com/wannaphong/Thai_W2P/blob/main/train.ipynb
Intended Use
- Converter thai word to thai phoneme
- Not suitable for other language.
Factors
- Based on thai word to thai phoneme problems.
Metrics
- Evaluation metrics include phoneme error rate (number error / number phonemes)
Training Data
Thai W2P
Evaluation Data
Thai W2P
Quantitative Analyses
epoch: 100
step: 100, loss: 0.03179970383644104
step: 200, loss: 0.04126007482409477
step: 300, loss: 0.01877519115805626
step: 400, loss: 0.03311225399374962
per: 0.0432
per: 0.0419
Ethical Considerations
thai phoneme based on website (wiktionary, Royal Institute et cetera). It may not be the dialect that you use in everyday life.
Caveats and Recommendations
- 1 Thai word only
PyThaiNLP