Considerations for language model inclusion in default package or download them later #298

bact · 2019-10-10T02:25:19Z

This issue works as a note on the size of different language models PyThaiNLP currently use.

Some of them are included in the package - will be immediately available after the installation of package.
Some of them are not included in the package - will be downloaded automatically on the first call, during runtime.

To include or not include: Pros and cons

Having language models included in the standard package
- Pros:
  - Less dependable to the network connection, less predictable behaviour
  - May be easier to manage and cache in continuous integration/testing environment
- Cons:
  - Larger package size
  - May waste user's disk space with files they never use
Download language models at the point of its first usage
- Pros:
  - Smaller package size
  - User only download what they really use
- Cons:
  - More dependable to the network connection, more predictable behaviour
  - Can slow down test (multiple separated file downloads, in sequence, is slower than one big file download)

Use pip to download language models

Optionally, we can also consider create a new package, upload them to PyPI, and using pip to facilitate downloads.

User can do something like pip install pythainlp-models-pos or pip install pythainlp-models[ner] or pip install pythainlp-models[all] during their environment setup, and then will never have to worry about them being downloaded during runtime.

This way, we can use PyPI as our data host and also benefit from any possible proxy and cache CI platforms/ISPs may have for PyPI. This can be more secure than our self-manage system as well.

PyPI standard package size limit is 60MB. But more can be requested.

Size and Hosting

Model	Filename	Size	Included in package?	Hosting
Language model (Thai Wikipedia)	thwiki_lm.pth	1.0 GB	No	?
Thai word vector	thai2vec.bin	62.5 MB	No	?
Thai Romanization	thai2rom-pytorch-attn-v0.1.tar	12.2 MB	No	?
Sentence segmentation (TED)	sentenceseg-ted.model	5.2MB	Yes	-
Thai Romanization v2	thai2rom-v2.hdf5	5.1 MB	No	?
Named-Entity Recognition	data.model	1.8 MB	No	?
Thai Wikipedia (for?)	thwiki_itos.pkl	1.5 MB	No	?
Thai Romanization	thai2rom-pytorch.tar	276 KB	No	?

(clearly, we need some standard naming convention here as well)

Training data and training scripts

See #344

Model card

Related to this, in terms of model description, see #471

Model auto-download

See discussion about pythainlp.corpus.get_corpus_path() at #385

The text was updated successfully, but these errors were encountered:

p16i · 2019-10-11T21:13:54Z

where are these models hosted?

wannaphong · 2019-10-12T04:30:55Z

@heytitle Model hosted at self-host. https://github.com/PyThaiNLP/pythainlp-corpus/blob/master/db.json

bact · 2019-11-14T10:46:27Z

Model files are currently hosted on either on Dropbox on GitHub.

wannaphong · 2020-05-27T16:32:22Z

Today, Thai Named-Entity Recognition model and Thai Romanization model host pythainlp-corpus on GitHub.

alexcombessie · 2020-12-03T15:02:58Z

Hi,

This would be quite important for me as I work on secure servers with no Internet access and no permission to write to arbitrary paths.

I have also noted a weird behavior. Even if I do not need to download any model or corpora, the library currently requires to write an empty db.json to PYTHAINLP_DATA_DIR.

In the short term, could this be removed?

Addition: my issue is also related to #475

Thanks,

Alex

wannaphong · 2020-12-06T06:44:57Z

Hi,

This would be quite important for me as I work on secure servers with no Internet access and no permission to write to arbitrary paths.

I have also noted a weird behavior. Even if I do not need to download any model or corpora, the library currently requires to write an empty db.json to PYTHAINLP_DATA_DIR.

In the short term, could this be removed?

Addition: my issue is also related to #475

Thanks,

Alex

I added customize path. You can customize path by download source code and customize path then you install package from source code. 82b8df9

wannaphong · 2023-07-11T19:58:22Z

Today, Many our models use model from huggingface hub that use same our method. I think it still be the best method to download the model.

bact added the corpus corpus/dataset-related issues label Oct 10, 2019

bact added this to the Future milestone Dec 6, 2019

bact modified the milestones: Future, 2.2 Dec 14, 2019

This was referenced Dec 19, 2019

Add CRFCut sentence segmentation #337

Merged

Add training script for language models #344

Closed

bact changed the title ~~Language model size~~ Considerations for language model inclusion in default package Dec 20, 2019

bact changed the title ~~Considerations for language model inclusion in default package~~ Considerations for language model inclusion in default package or download them later Dec 20, 2019

bact mentioned this issue May 27, 2020

Properly check if download() is needed in get_corpus_path() #414

Merged

bact closed this as completed in #414 May 27, 2020

bact reopened this May 27, 2020

wannaphong modified the milestones: 2.2, Future Mar 30, 2021

horngep mentioned this issue Jun 1, 2021

ThaiNER - URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] #575

Closed

wannaphong mentioned this issue Jun 1, 2021

Add get_corpus_default_db and thainer 1.5 model #576

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Considerations for language model inclusion in default package or download them later #298

Considerations for language model inclusion in default package or download them later #298

bact commented Oct 10, 2019 •

edited

Loading

p16i commented Oct 11, 2019

wannaphong commented Oct 12, 2019

bact commented Nov 14, 2019

wannaphong commented May 27, 2020

alexcombessie commented Dec 3, 2020 •

edited

Loading

wannaphong commented Dec 6, 2020

wannaphong commented Jul 11, 2023

Considerations for language model inclusion in default package or download them later #298

Considerations for language model inclusion in default package or download them later #298

Comments

bact commented Oct 10, 2019 • edited Loading

To include or not include: Pros and cons

Use pip to download language models

Size and Hosting

Training data and training scripts

Model card

Model auto-download

p16i commented Oct 11, 2019

wannaphong commented Oct 12, 2019

bact commented Nov 14, 2019

wannaphong commented May 27, 2020

alexcombessie commented Dec 3, 2020 • edited Loading

wannaphong commented Dec 6, 2020

wannaphong commented Jul 11, 2023

bact commented Oct 10, 2019 •

edited

Loading

alexcombessie commented Dec 3, 2020 •

edited

Loading