Add training script for language models #344

bact · 2019-12-20T18:09:41Z

Almost all models we use now (see list in #298) are trained privately by different contributors. With code on notebooks or scripts that may be private or may be open source but difficult to follow.

To make PyThaiNLP more transparent and more customizable by users, should try to put training scripts or instructions (can be pointers) somewhere in the repo.

Known scripts/notebooks and data

Model	Filename	Training Script	Training Data
CRF-Cut	sentenceseg-ted.model	https://colab.research.google.com/drive/12nszk-N5LwpHzitlYvhNWVUDSBj30Z1Y	https://github.com/vistec-AI/ted_crawler
Enhanced Thai Character Cluster (ETCC)	etcc.txt	https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ	https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ
Language model (Thai Wikipedia)	thwiki_lm.pth	?	?
Thai Grapheme-to-Phoneme (Thai G2P)	thaig2p-0.1.tar	https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb	https://github.com/wannaphong/thai-g2p/blob/master/wiktionary-11-2-2020.tsv
Thai word vector	thai2vec.bin	https://github.com/cstorm125/thai2fit	?
Sentence segmentation (TED)	sentenceseg-ted.model	https://github.com/vistec-AI/ted_crawler	TED Thai subtitles
Named-Entity Recognition	data.model	https://github.com/wannaphongcom/thai-ner	?
Thai Wikipedia (for?)	thwiki_itos.pkl	?	?
POS Tagger	ud_thai-pud_pt_tagger.dill	https://github.com/PyThaiNLP/pythainlp_notebook/tree/master/postag	?
Thai Romanization	thai2rom-pytorch-attn-v0.1.tar	https://github.com/artificiala/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb	https://github.com/wannaphong/thai-romanization
Thai Romanization v2	thai2rom-v2.hdf5	?	?
Thai Romanization	thai2rom-pytorch.tar	https://github.com/artificiala/thai-romanization	https://github.com/wannaphongcom/thai-romanization/

p16i · 2020-02-28T16:58:09Z

IMHO, it would be good if we have a guideline for this kind of resources.

For example, projects and models that will be incorporated into PyThaiNLP's umbrella should have

documents (e.g. training procedure and environment, datasets)
code for retraining
evaluation results

wannaphong · 2020-04-12T18:27:43Z

Add ETCC and Thai Romanization

wannaphong · 2020-04-28T18:43:30Z

Add CRF-Cut

p16i · 2020-06-26T15:58:27Z

This issue is quite challenging because those systems have or use very difference frameworks. Instead, we might organise them based on the nature of the tasks and data.

Roughly speaking, we might decompose the problem into two parts, datasets and models. My idea is that we can implement TorchText for the former, and use Pytorch Lightning for model development and training.

Having said that, we probably need to start with TorchText and see how it goes.

Related to #440

What do you think?

wannaphong · 2020-07-18T15:17:38Z

Add Thai Grapheme-to-Phoneme (Thai G2P).

wannaphong · 2020-12-19T09:51:23Z

@bact move to wiki?

bact · 2020-12-19T09:53:35Z

Good idea. Agree to move to wiki

wannaphong · 2020-12-19T10:03:09Z

I move this page to https://github.com/PyThaiNLP/pythainlp/wiki/Language-Models

bact added enhancement enhance functionalities corpus corpus/dataset-related issues documentation improve documentation and test cases labels Dec 20, 2019

bact added this to the 2.2 milestone Dec 20, 2019

bact mentioned this issue Dec 20, 2019

Add CRFCut sentence segmentation #337

Merged

bact mentioned this issue Dec 20, 2019

Considerations for language model inclusion in default package or download them later #298

Open

wannaphong closed this as completed Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add training script for language models #344

Add training script for language models #344

bact commented Dec 20, 2019 •

edited by wannaphong

Loading

p16i commented Feb 28, 2020 •

edited

Loading

wannaphong commented Apr 12, 2020

wannaphong commented Apr 28, 2020

p16i commented Jun 26, 2020 •

edited

Loading

wannaphong commented Jul 18, 2020

wannaphong commented Dec 19, 2020

bact commented Dec 19, 2020

wannaphong commented Dec 19, 2020

Add training script for language models #344

Add training script for language models #344

Comments

bact commented Dec 20, 2019 • edited by wannaphong Loading

Known scripts/notebooks and data

p16i commented Feb 28, 2020 • edited Loading

wannaphong commented Apr 12, 2020

wannaphong commented Apr 28, 2020

p16i commented Jun 26, 2020 • edited Loading

wannaphong commented Jul 18, 2020

wannaphong commented Dec 19, 2020

bact commented Dec 19, 2020

wannaphong commented Dec 19, 2020

bact commented Dec 20, 2019 •

edited by wannaphong

Loading

p16i commented Feb 28, 2020 •

edited

Loading

p16i commented Jun 26, 2020 •

edited

Loading