Question: newmm tokenizer, why not just thai characters? #855

konbraphat51 · 2023-10-25T04:53:33Z

Hi, I have a question about implemention of newmm tokenizer.

https://github.com/PyThaiNLP/pythainlp/blob/e3a01772f1dbe578e81119214d85226c0cbde466/pythainlp/tokenize/newmm.py#L38C1-L46C2

Here, why not just do as "permit only Thai characters"?

I am having a troble that sometimes signs are included in the tokens.
Ex. "ถ้าไม่รังเกียจสีหน้า(รถ)" -> ถ้า / ไม่รังเกียจ / สีหน้า / (รถ) //"รถ" is in the dictionary used

Also, if this is "Dictionary-based maximal matching word segmentation", why it didn't took just "รถ" ?

wannaphong · 2023-10-26T18:07:15Z

I think because It's different algorithm.

newmm is a dictionary-based maximal matching word segmentation that constrained with Thai Character Cluster (TCC) boundaries.
mm is a dictionary-based maximal matching word segmentation only.

If you want to use the Dictionary-based maximal matching word segmentation only, you can change newmm to mm in engine. word_tokenize(text, engine="mm")

About newmm: https://github.com/PyThaiNLP/pythainlp/wiki/newmm-tokenization

wannaphong · 2023-10-26T18:08:32Z

I don't sure it is bug or not but It should be fixed this issues.

bact · 2023-10-27T07:53:37Z

Currently it gives ["(รถ)"], but ["(", "รถ", ")"] is expected.

bact added the bug bugs in the library label Oct 25, 2023

bact added this to the Future milestone Oct 25, 2023

wannaphong linked a pull request Nov 1, 2023 that will close this issue

Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" #856

Merged

2 tasks

konbraphat51 mentioned this issue Nov 1, 2023

Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" #856

Merged

2 tasks

wannaphong closed this as completed in #856 Nov 5, 2023

github-project-automation bot added this to PyThaiNLP Aug 29, 2024

github-project-automation bot moved this to In progress in PyThaiNLP Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: newmm tokenizer, why not just thai characters? #855

Question: newmm tokenizer, why not just thai characters? #855

konbraphat51 commented Oct 25, 2023

wannaphong commented Oct 26, 2023 •

edited

Loading

wannaphong commented Oct 26, 2023

bact commented Oct 27, 2023

Question: newmm tokenizer, why not just thai characters? #855

Question: newmm tokenizer, why not just thai characters? #855

Comments

konbraphat51 commented Oct 25, 2023

wannaphong commented Oct 26, 2023 • edited Loading

wannaphong commented Oct 26, 2023

bact commented Oct 27, 2023

wannaphong commented Oct 26, 2023 •

edited

Loading