Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: newmm tokenizer, why not just thai characters? #855

Closed
konbraphat51 opened this issue Oct 25, 2023 · 3 comments · Fixed by #856
Closed

Question: newmm tokenizer, why not just thai characters? #855

konbraphat51 opened this issue Oct 25, 2023 · 3 comments · Fixed by #856
Labels
bug bugs in the library
Milestone

Comments

@konbraphat51
Copy link
Contributor

Hi, I have a question about implemention of newmm tokenizer.

https://github.com/PyThaiNLP/pythainlp/blob/e3a01772f1dbe578e81119214d85226c0cbde466/pythainlp/tokenize/newmm.py#L38C1-L46C2

Here, why not just do as "permit only Thai characters"?

I am having a troble that sometimes signs are included in the tokens.
Ex. "ถ้าไม่รังเกียจสีหน้า(รถ)" -> ถ้า / ไม่รังเกียจ / สีหน้า / (รถ) //"รถ" is in the dictionary used

Also, if this is "Dictionary-based maximal matching word segmentation", why it didn't took just "รถ" ?

@bact bact added the bug bugs in the library label Oct 25, 2023
@bact bact added this to the Future milestone Oct 25, 2023
@wannaphong
Copy link
Member

wannaphong commented Oct 26, 2023

I think because It's different algorithm.

  • newmm is a dictionary-based maximal matching word segmentation that constrained with Thai Character Cluster (TCC) boundaries.
  • mm is a dictionary-based maximal matching word segmentation only.

If you want to use the Dictionary-based maximal matching word segmentation only, you can change newmm to mm in engine. word_tokenize(text, engine="mm")

About newmm: https://github.com/PyThaiNLP/pythainlp/wiki/newmm-tokenization

@wannaphong
Copy link
Member

I don't sure it is bug or not but It should be fixed this issues.

@bact
Copy link
Member

bact commented Oct 27, 2023

Currently it gives ["(รถ)"], but ["(", "รถ", ")"] is expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bugs in the library
Projects
Status: In progress
Development

Successfully merging a pull request may close this issue.

3 participants