Description
Detailed description
I would suggest adding a large (300K+ sized) dictionary for better tokenization performance.
Context
I am currently doing text mining of Pantip. Pantip has a lot of new(?) words and proper-nouns that pythainlp.corpus.common.thai_words()
couldn't catch.
But when I added new words from
- Volubilis Dictionary for new words
- Titles of wikipedia-Th articles for proper-nouns
and the performance improved by around 10%. The dictionary became 300K words in total.
I guess that it could be useful if this large dictionary data is easily (just import from pythainlp modules) available for other users too.
Possible implementation
Simply make a dictionary data from the sources above and serve it as pythainlp.corpus.common.thai_words_large()
or something (Since dynamically downloading from the sources above could be a burden for the providers)