Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add training script for language models #344

Closed
bact opened this issue Dec 20, 2019 · 8 comments
Closed

Add training script for language models #344

bact opened this issue Dec 20, 2019 · 8 comments
Labels
corpus corpus/dataset-related issues documentation improve documentation and test cases enhancement enhance functionalities
Milestone

Comments

@bact
Copy link
Member

bact commented Dec 20, 2019

Almost all models we use now (see list in #298) are trained privately by different contributors. With code on notebooks or scripts that may be private or may be open source but difficult to follow.

To make PyThaiNLP more transparent and more customizable by users, should try to put training scripts or instructions (can be pointers) somewhere in the repo.

Known scripts/notebooks and data

Model Filename Training Script Training Data
CRF-Cut sentenceseg-ted.model https://colab.research.google.com/drive/12nszk-N5LwpHzitlYvhNWVUDSBj30Z1Y https://github.com/vistec-AI/ted_crawler
Enhanced Thai Character Cluster (ETCC) etcc.txt https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ https://colab.research.google.com/drive/1UTQgxxMRxOr9Jp1B1jcq1frBNvorhtBQ
Language model (Thai Wikipedia) thwiki_lm.pth ? ?
Thai Grapheme-to-Phoneme (Thai G2P) thaig2p-0.1.tar https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb https://github.com/wannaphong/thai-g2p/blob/master/wiktionary-11-2-2020.tsv
Thai word vector thai2vec.bin https://github.com/cstorm125/thai2fit ?
Sentence segmentation (TED) sentenceseg-ted.model https://github.com/vistec-AI/ted_crawler TED Thai subtitles
Named-Entity Recognition data.model https://github.com/wannaphongcom/thai-ner ?
Thai Wikipedia (for?) thwiki_itos.pkl ? ?
POS Tagger ud_thai-pud_pt_tagger.dill https://github.com/PyThaiNLP/pythainlp_notebook/tree/master/postag ?
Thai Romanization thai2rom-pytorch-attn-v0.1.tar https://github.com/artificiala/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb https://github.com/wannaphong/thai-romanization
Thai Romanization v2 thai2rom-v2.hdf5 ? ?
Thai Romanization thai2rom-pytorch.tar https://github.com/artificiala/thai-romanization https://github.com/wannaphongcom/thai-romanization/
@bact bact added enhancement enhance functionalities corpus corpus/dataset-related issues documentation improve documentation and test cases labels Dec 20, 2019
@bact bact added this to the 2.2 milestone Dec 20, 2019
@p16i
Copy link
Contributor

p16i commented Feb 28, 2020

IMHO, it would be good if we have a guideline for this kind of resources.

For example, projects and models that will be incorporated into PyThaiNLP's umbrella should have

  • documents (e.g. training procedure and environment, datasets)
  • code for retraining
  • evaluation results

@wannaphong
Copy link
Member

Add ETCC and Thai Romanization

@wannaphong
Copy link
Member

Add CRF-Cut

@p16i
Copy link
Contributor

p16i commented Jun 26, 2020

This issue is quite challenging because those systems have or use very difference frameworks. Instead, we might organise them based on the nature of the tasks and data.

Roughly speaking, we might decompose the problem into two parts, datasets and models. My idea is that we can implement TorchText for the former, and use Pytorch Lightning for model development and training.

Having said that, we probably need to start with TorchText and see how it goes.

Related to #440

What do you think?

@wannaphong
Copy link
Member

Add Thai Grapheme-to-Phoneme (Thai G2P).

@wannaphong
Copy link
Member

@bact move to wiki?

@bact
Copy link
Member Author

bact commented Dec 19, 2020

Good idea. Agree to move to wiki

@wannaphong
Copy link
Member

I move this page to https://github.com/PyThaiNLP/pythainlp/wiki/Language-Models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
corpus corpus/dataset-related issues documentation improve documentation and test cases enhancement enhance functionalities
Projects
None yet
Development

No branches or pull requests

3 participants