Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Suggestion] Add a large dictionary data #858

Closed
konbraphat51 opened this issue Nov 7, 2023 · 21 comments · Fixed by #869 or #870
Closed

[Suggestion] Add a large dictionary data #858

konbraphat51 opened this issue Nov 7, 2023 · 21 comments · Fixed by #869 or #870
Labels
corpus corpus/dataset-related issues

Comments

@konbraphat51
Copy link
Contributor

Detailed description

I would suggest adding a large (300K+ sized) dictionary for better tokenization performance.

Context

I am currently doing text mining of Pantip. Pantip has a lot of new(?) words and proper-nouns that pythainlp.corpus.common.thai_words() couldn't catch.
But when I added new words from

and the performance improved by around 10%. The dictionary became 300K words in total.

I guess that it could be useful if this large dictionary data is easily (just import from pythainlp modules) available for other users too.

Possible implementation

Simply make a dictionary data from the sources above and serve it as pythainlp.corpus.common.thai_words_large() or something (Since dynamically downloading from the sources above could be a burden for the providers)

@wannaphong wannaphong added the corpus corpus/dataset-related issues label Nov 8, 2023
@wannaphong
Copy link
Member

I agree but Titles of wikipedia-Th articles may does not enough good for word list because Thai wikipedia use long text that can segment and many titles are not named entity. Example: https://th.wikipedia.org/wiki/รายชื่อปฏิบัติการทางทหารระหว่างสงครามโลกครั้งที่สอง/

@konbraphat51
Copy link
Contributor Author

Hmm, true…
Is there any handful pythainlp functions to detect non named entity?

@wannaphong
Copy link
Member

Hmm, true… Is there any handful pythainlp functions to detect non named entity?

Yes, PyThaiNLP has named entity recognition but maybe can't detect non named entity for a word. I think you can start by uses deepcut to count words if more than 3 (or more) words, remove the word.

@bact
Copy link
Member

bact commented Nov 8, 2023

I guess some categories (like "รายชื่อxxx") should be easy to filter out by just pattern match.

@konbraphat51
Copy link
Contributor Author

I will check what patterns of the wikipedia titles exists.
Please wait for a while

@bact
Copy link
Member

bact commented Nov 10, 2023

Looks great.

Those titles with parenthesis: "XXX (YYY)" may able to split to two entries: "XXX", "YYY" (where XXX and YYY may duplicated with other existing entries; in that case, just keep one).

@konbraphat51
Copy link
Contributor Author

konbraphat51 commented Nov 22, 2023

I tried to cleaned wikipedia data. How this seems?
https://github.com/konbraphat51/Thai_Dictionary_Cleaner/blob/wiki/Wikipedia/wikipedia_nlp.txt

This is not 100% perfect but could be worth using

https://github.com/konbraphat51/Thai_Dictionary_Cleaner/blob/wiki/Wikipedia/Wikipedia.ipynb

@wannaphong
Copy link
Member

I tried to cleaned wikipedia data. How this seems? https://github.com/konbraphat51/Thai_Dictionary_Cleaner/blob/wiki/wikipedia_nlp.txt

This is not 100% perfect but could be worth using

notebook: https://github.com/konbraphat51/Thai_Dictionary_Cleaner/blob/wiki/Wikipedia.ipynb

It look good for me. @bact What do you think?

@konbraphat51
Copy link
Contributor Author

If seems good, I would make a corpus implementation of these.

How could I handle with the license of the data source?
wikipedia: https://github.com/konbraphat51/Thai_Dictionary_Cleaner/blob/wiki/Wikipedia/LISENCE.md
Volubilis: https://github.com/konbraphat51/Thai_Dictionary_Cleaner/blob/main/Volubilis/LISENCE.md_

@bact
Copy link
Member

bact commented Nov 27, 2023

Great work.
One thing: How we are dealing with digits/numbers at this moment?

I currently see the likes of

  • ๑๖กรกฎาคม
  • ๓๑มีนาคม
  • ๑๘
    in the output word list (wikipedia_nlp.txt)

@wannaphong
Copy link
Member

If seems good, I would make a corpus implementation of these.

How could I handle with the license of the data source? wikipedia: https://github.com/konbraphat51/Thai_Dictionary_Cleaner/blob/wiki/Wikipedia/LISENCE.md Volubilis: https://github.com/konbraphat51/Thai_Dictionary_Cleaner/blob/main/Volubilis/LISENCE.md_

I think it can be fair use from USA's law (just use the title that manybe < 1% of all data) for Wikipedia. I think you can change to CC-0 or just CC-BY.

I don't sure about Volubilis.

@bact
Copy link
Member

bact commented Nov 27, 2023

For the Wikipedia title, I don't think we will have any problem to follow the original license. Text in Wikipedia is CC BY-SA. CC BY-SA is an open license and we can use/distribute it as it is.

Volubilis can be tricky. I'm not sure.

@konbraphat51
Copy link
Contributor Author

konbraphat51 commented Nov 29, 2023

Great work.
One thing: How we are dealing with digits/numbers at this moment?
I currently see the likes of
๑๖กรกฎาคม
๓๑มีนาคม
๑๘
in the output word list (wikipedia_nlp.txt)

Oh I didn't see that. I excluded Thai numbers now.

@konbraphat51
Copy link
Contributor Author

What is the difference between Wikipedia and Volubilis? I guess they are the same licenseCreative Commons Attribution-ShareAlike 4.0 International License

@bact
Copy link
Member

bact commented Nov 29, 2023

If Volubilis uses CC BY-SA, we can just use CC BY-SA as well.

@wannaphong
Copy link
Member

Volubilis look doesn't have the license but It say free. https://sourceforge.net/projects/belisan/

@bact
Copy link
Member

bact commented Nov 29, 2023

I think the code is nice and the output is usable.

After fixing the case of having a one-character word, like:

we can merge this and provide it as an option for user to use.

An information should also be provided to the user about the characteristics of the word list
(this will appear somewhere in the API doc).

For example, for the word list derived from Wikipedia article titles:

  • By nature of Wikipedia titles, they are mostly nouns and noun phrases.
    • So, a word from this list can be very long because some of them are noun phrases.
    • Examples: วิวัฒนาการกระดูกเล็กสำหรับได้ยินของสัตว์เลี้ยงลูกด้วยนม, ประเทศไทยสมัยก่อนประวัติศาสตร์, ภูเขาชิงเฉิงและระบบชลประทานตูเจียงยั่น, การปิดเมืองอู่ฮั่นเนื่องด้วยการระบาดทั่วของโรคโควิด
    • For the length of the word, this is something the user can deal with it themselves easily.
    • For noun phrases, depends on the applications, the user has to decide for themselves as well. We just provide the info.
  • By nature of Wikipedia titles as well, they contain some misspellings.
    • These misspelling [or alternative spellings] serve as redirection page in Wikipedia, so people who search with a word with typo will be redirected to a page.
    • Examples: ประเทศเยอรมันนี, การโคลนิ่ง, รายชิ่อสนามกีฬาเรียงตามความจุ*
    • So the word list may not be suitable for spelling corrections, but may suitable for tokenization in some applications.

--

*Note: for รายชิ่อสนามกีฬาเรียงตามความจุ I think we should just manually remove this one from the final output. I already ask in Thai Wikipedia to remove the article, it shouldn't be there in the first place.

@konbraphat51
Copy link
Contributor Author

After fixing the case of having a one-character word, like:

OK, I did it.

I will now make a branch to make an optional corpus

konbraphat51 added a commit to konbraphat51/pythainlp that referenced this issue Nov 30, 2023
@konbraphat51
Copy link
Contributor Author

Okey I made 2 PRs.
Let's go one by one

@bact bact closed this as completed in #870 Dec 1, 2023
@bact bact reopened this Dec 1, 2023
@bact
Copy link
Member

bact commented Dec 1, 2023

Volubilis look doesn't have the license but It say free. https://sourceforge.net/projects/belisan/

@konbraphat51 can you check the license of Volubilis again?
Currently in the recent commit the license file said it use CC BY-SA.
Do you have a link for that?

I found that. The license is shown on the right pane in the blog at https://belisan-volubilis.blogspot.com/

@bact bact closed this as completed in #869 Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
corpus corpus/dataset-related issues
Projects
None yet
3 participants