-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" #856
Conversation
Before: directly descript non-thai-characters by rule-based After: Just set as "anything except Thai-characters"
It seems that this change makes the tokenization more minute than the test-case. |
Can you update the pull request? 9df5a4a |
Greetings PR check
I don't understand this. It this error common in this project? |
Update thai2fit tokenizer
Updated |
It seems that there is unit-test error occuring by 9df5a4a Ignorable? |
Yes, It's self-host issues but I don't have time to new setup. The unit-test by GitHub is look good https://github.com/PyThaiNLP/pythainlp/actions/runs/6718024461/job/18256943528 |
I add some rule to fixed the error. konbraphat51#2 |
Fixed regex
For further mentenance easier
Hello @konbraphat51! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2023-11-01 13:08:42 UTC |
I merged and modified @wannaphong PR. Please check. |
OK. It look |
I fixed. |
In my case, fb3e7bb showed The last |
3d889f7 showed
|
Oh sorry, I was testing by |
Interntion for ` \t\r\n`
Kudos, SonarCloud Quality Gate passed! |
Added the commentation for further maintenance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine. Doesn't break the number grouping.
What does this changes
Make the newmm tokenization more accurate; recognize more characters as "non-thai"
What was wrong
#855
It sometimes didn't recognize non-thai symbols as non-thai
"(คนไม่เอา)" -> ['(คน', 'ไม่', 'เอา', ')']
"กม/ชม" -> ['กม', '/ชม']
"สีหน้า(รถ)" -> ['สีหน้า', '(รถ)']
How this fixes it
Fixed the recognition method of "non-thai-character".
The examples above are all improved.
Your checklist for this pull request