Skip to content

[Suggestion] Add a large dictionary data #858

Closed
@konbraphat51

Description

@konbraphat51

Detailed description

I would suggest adding a large (300K+ sized) dictionary for better tokenization performance.

Context

I am currently doing text mining of Pantip. Pantip has a lot of new(?) words and proper-nouns that pythainlp.corpus.common.thai_words() couldn't catch.
But when I added new words from

and the performance improved by around 10%. The dictionary became 300K words in total.

I guess that it could be useful if this large dictionary data is easily (just import from pythainlp modules) available for other users too.

Possible implementation

Simply make a dictionary data from the sources above and serve it as pythainlp.corpus.common.thai_words_large() or something (Since dynamically downloading from the sources above could be a burden for the providers)

Metadata

Metadata

Assignees

No one assigned

    Labels

    corpuscorpus/dataset-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions