-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Suggestion] Add a large dictionary data #858
Comments
I agree but Titles of wikipedia-Th articles may does not enough good for word list because Thai wikipedia use long text that can segment and many titles are not named entity. Example: https://th.wikipedia.org/wiki/รายชื่อปฏิบัติการทางทหารระหว่างสงครามโลกครั้งที่สอง/ |
Hmm, true… |
Yes, PyThaiNLP has named entity recognition but maybe can't detect non named entity for a word. I think you can start by uses deepcut to count words if more than 3 (or more) words, remove the word. |
I guess some categories (like "รายชื่อxxx") should be easy to filter out by just pattern match. |
I will check what patterns of the wikipedia titles exists. |
I made a plan to clean the Wikipedia titles data: https://github.com/konbraphat51/Thai_Dictionary_Cleaner/blob/wiki/Wikipedia.ipynb |
Looks great. Those titles with parenthesis: "XXX (YYY)" may able to split to two entries: "XXX", "YYY" (where XXX and YYY may duplicated with other existing entries; in that case, just keep one). |
I tried to cleaned wikipedia data. How this seems? This is not 100% perfect but could be worth using |
It look good for me. @bact What do you think? |
If seems good, I would make a corpus implementation of these. How could I handle with the license of the data source? |
Great work. I currently see the likes of
|
I think it can be fair use from USA's law (just use the title that manybe < 1% of all data) for Wikipedia. I think you can change to CC-0 or just CC-BY. I don't sure about Volubilis. |
For the Wikipedia title, I don't think we will have any problem to follow the original license. Text in Wikipedia is CC BY-SA. CC BY-SA is an open license and we can use/distribute it as it is. Volubilis can be tricky. I'm not sure. |
Oh I didn't see that. I excluded Thai numbers now. |
What is the difference between Wikipedia and Volubilis? I guess they are the same licenseCreative Commons Attribution-ShareAlike 4.0 International License |
If Volubilis uses CC BY-SA, we can just use CC BY-SA as well. |
Volubilis look doesn't have the license but It say free. https://sourceforge.net/projects/belisan/ |
I think the code is nice and the output is usable. After fixing the case of having a one-character word, like: we can merge this and provide it as an option for user to use. An information should also be provided to the user about the characteristics of the word list For example, for the word list derived from Wikipedia article titles:
-- *Note: for รายชิ่อสนามกีฬาเรียงตามความจุ I think we should just manually remove this one from the final output. I already ask in Thai Wikipedia to remove the article, it shouldn't be there in the first place. |
OK, I did it. I will now make a branch to make an optional corpus |
Okey I made 2 PRs. |
I found that. The license is shown on the right pane in the blog at https://belisan-volubilis.blogspot.com/ |
Detailed description
I would suggest adding a large (300K+ sized) dictionary for better tokenization performance.
Context
I am currently doing text mining of Pantip. Pantip has a lot of new(?) words and proper-nouns that
pythainlp.corpus.common.thai_words()
couldn't catch.But when I added new words from
and the performance improved by around 10%. The dictionary became 300K words in total.
I guess that it could be useful if this large dictionary data is easily (just import from pythainlp modules) available for other users too.
Possible implementation
Simply make a dictionary data from the sources above and serve it as
pythainlp.corpus.common.thai_words_large()
or something (Since dynamically downloading from the sources above could be a burden for the providers)The text was updated successfully, but these errors were encountered: