Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Considerations for language model inclusion in default package or download them later #298

Open
bact opened this issue Oct 10, 2019 · 7 comments · Fixed by #414
Open

Considerations for language model inclusion in default package or download them later #298

bact opened this issue Oct 10, 2019 · 7 comments · Fixed by #414
Labels
corpus corpus/dataset-related issues
Milestone

Comments

@bact
Copy link
Member

bact commented Oct 10, 2019

This issue works as a note on the size of different language models PyThaiNLP currently use.

  • Some of them are included in the package - will be immediately available after the installation of package.
  • Some of them are not included in the package - will be downloaded automatically on the first call, during runtime.

To include or not include: Pros and cons

  • Having language models included in the standard package
    • Pros:
      • Less dependable to the network connection, less predictable behaviour
      • May be easier to manage and cache in continuous integration/testing environment
    • Cons:
      • Larger package size
      • May waste user's disk space with files they never use
  • Download language models at the point of its first usage
    • Pros:
      • Smaller package size
      • User only download what they really use
    • Cons:
      • More dependable to the network connection, more predictable behaviour
      • Can slow down test (multiple separated file downloads, in sequence, is slower than one big file download)

Use pip to download language models

Optionally, we can also consider create a new package, upload them to PyPI, and using pip to facilitate downloads.

User can do something like pip install pythainlp-models-pos or pip install pythainlp-models[ner] or pip install pythainlp-models[all] during their environment setup, and then will never have to worry about them being downloaded during runtime.

This way, we can use PyPI as our data host and also benefit from any possible proxy and cache CI platforms/ISPs may have for PyPI. This can be more secure than our self-manage system as well.

PyPI standard package size limit is 60MB. But more can be requested.

Size and Hosting

Model Filename Size Included in package? Hosting
Language model (Thai Wikipedia) thwiki_lm.pth 1.0 GB No ?
Thai word vector thai2vec.bin 62.5 MB No ?
Thai Romanization thai2rom-pytorch-attn-v0.1.tar 12.2 MB No ?
Sentence segmentation (TED) sentenceseg-ted.model 5.2MB Yes -
Thai Romanization v2 thai2rom-v2.hdf5 5.1 MB No ?
Named-Entity Recognition data.model 1.8 MB No ?
Thai Wikipedia (for?) thwiki_itos.pkl 1.5 MB No ?
Thai Romanization thai2rom-pytorch.tar 276 KB No ?

(clearly, we need some standard naming convention here as well)

Training data and training scripts

See #344

Model card

Related to this, in terms of model description, see #471

Model auto-download

See discussion about pythainlp.corpus.get_corpus_path() at #385

@bact bact added the corpus corpus/dataset-related issues label Oct 10, 2019
@p16i
Copy link
Contributor

p16i commented Oct 11, 2019

where are these models hosted?

@wannaphong
Copy link
Member

@heytitle Model hosted at self-host. https://github.com/PyThaiNLP/pythainlp-corpus/blob/master/db.json

@bact
Copy link
Member Author

bact commented Nov 14, 2019

Model files are currently hosted on either on Dropbox on GitHub.

@bact bact added this to the Future milestone Dec 6, 2019
@bact bact modified the milestones: Future, 2.2 Dec 14, 2019
@bact bact changed the title Language model size Considerations for language model inclusion in default package Dec 20, 2019
@bact bact changed the title Considerations for language model inclusion in default package Considerations for language model inclusion in default package or download them later Dec 20, 2019
@bact bact closed this as completed in #414 May 27, 2020
@bact bact reopened this May 27, 2020
@wannaphong
Copy link
Member

Today, Thai Named-Entity Recognition model and Thai Romanization model host pythainlp-corpus on GitHub.

@alexcombessie
Copy link

alexcombessie commented Dec 3, 2020

Hi,

This would be quite important for me as I work on secure servers with no Internet access and no permission to write to arbitrary paths.

I have also noted a weird behavior. Even if I do not need to download any model or corpora, the library currently requires to write an empty db.json to PYTHAINLP_DATA_DIR.

In the short term, could this be removed?

Addition: my issue is also related to #475

Thanks,

Alex

@wannaphong
Copy link
Member

Hi,

This would be quite important for me as I work on secure servers with no Internet access and no permission to write to arbitrary paths.

I have also noted a weird behavior. Even if I do not need to download any model or corpora, the library currently requires to write an empty db.json to PYTHAINLP_DATA_DIR.

In the short term, could this be removed?

Addition: my issue is also related to #475

Thanks,

Alex

I added customize path. You can customize path by download source code and customize path then you install package from source code. 82b8df9

@wannaphong
Copy link
Member

Today, Many our models use model from huggingface hub that use same our method. I think it still be the best method to download the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
corpus corpus/dataset-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants