You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here, why not just do as "permit only Thai characters"?
I am having a troble that sometimes signs are included in the tokens.
Ex. "ถ้าไม่รังเกียจสีหน้า(รถ)" -> ถ้า / ไม่รังเกียจ / สีหน้า / (รถ) //"รถ" is in the dictionary used
Also, if this is "Dictionary-based maximal matching word segmentation", why it didn't took just "รถ" ?
The text was updated successfully, but these errors were encountered:
newmm is a dictionary-based maximal matching word segmentation that constrained with Thai Character Cluster (TCC) boundaries.
mm is a dictionary-based maximal matching word segmentation only.
If you want to use the Dictionary-based maximal matching word segmentation only, you can change newmm to mm in engine. word_tokenize(text, engine="mm")
Hi, I have a question about implemention of newmm tokenizer.
https://github.com/PyThaiNLP/pythainlp/blob/e3a01772f1dbe578e81119214d85226c0cbde466/pythainlp/tokenize/newmm.py#L38C1-L46C2
Here, why not just do as "permit only Thai characters"?
I am having a troble that sometimes signs are included in the tokens.
Ex. "ถ้าไม่รังเกียจสีหน้า(รถ)" -> ถ้า / ไม่รังเกียจ / สีหน้า / (รถ) //"รถ" is in the dictionary used
Also, if this is "Dictionary-based maximal matching word segmentation", why it didn't took just "รถ" ?
The text was updated successfully, but these errors were encountered: