Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" #856

Merged
merged 13 commits into from
Nov 5, 2023

Conversation

konbraphat51
Copy link
Contributor

@konbraphat51 konbraphat51 commented Nov 1, 2023

What does this changes

Make the newmm tokenization more accurate; recognize more characters as "non-thai"

What was wrong

#855
It sometimes didn't recognize non-thai symbols as non-thai
"(คนไม่เอา)" -> ['(คน', 'ไม่', 'เอา', ')']
"กม/ชม" -> ['กม', '/ชม']
"สีหน้า(รถ)" -> ['สีหน้า', '(รถ)']

How this fixes it

Fixed the recognition method of "non-thai-character".
The examples above are all improved.

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test

Before: directly descript non-thai-characters by rule-based
After: Just set as "anything except Thai-characters"
@wannaphong wannaphong linked an issue Nov 1, 2023 that may be closed by this pull request
@bact bact added the bug bugs in the library label Nov 1, 2023
@bact bact added this to the 4.1 milestone Nov 1, 2023
@konbraphat51
Copy link
Contributor Author

['👍👍👍 # ana', 'มาก', 'xxrep', 'xxwrep', 'น้อย', '.1146'] != ['xxwrep', '👍', '#', 'ana', 'มาก', 'xxrep', 'xxwrep', 'น้อย', '.', '1146']

It seems that this change makes the tokenization more minute than the test-case.
What do orginizers here think about this?

@wannaphong
Copy link
Member

['👍👍👍 # ana', 'มาก', 'xxrep', 'xxwrep', 'น้อย', '.1146'] != ['xxwrep', '👍', '#', 'ana', 'มาก', 'xxrep', 'xxwrep', 'น้อย', '.', '1146']

It seems that this change makes the tokenization more minute than the test-case. What do orginizers here think about this?

Can you update the pull request? 9df5a4a

@konbraphat51
Copy link
Contributor Author

Greetings PR check

Resource not accessible by integration

I don't understand this. It this error common in this project?

Update thai2fit tokenizer
@konbraphat51
Copy link
Contributor Author

['👍👍👍 # ana', 'มาก', 'xxrep', 'xxwrep', 'น้อย', '.1146'] != ['xxwrep', '👍', '#', 'ana', 'มาก', 'xxrep', 'xxwrep', 'น้อย', '.', '1146']

It seems that this change makes the tokenization more minute than the test-case. What do orginizers here think about this?

Can you update the pull request? 9df5a4a

Updated

@konbraphat51
Copy link
Contributor Author

konbraphat51 commented Nov 1, 2023

It seems that there is unit-test error occuring by 9df5a4a

Ignorable?

@wannaphong
Copy link
Member

Ignorable

Yes, It's self-host issues but I don't have time to new setup. The unit-test by GitHub is look good https://github.com/PyThaiNLP/pythainlp/actions/runs/6718024461/job/18256943528

@coveralls
Copy link

Coverage Status

coverage: 0.0% (-87.0%) from 86.983% when pulling a5b6db4 on konbraphat51:dev into 9df5a4a on PyThaiNLP:dev.

@wannaphong
Copy link
Member

wannaphong commented Nov 1, 2023

I add some rule to fixed the error. konbraphat51#2

@pep8speaks
Copy link

pep8speaks commented Nov 1, 2023

Hello @konbraphat51! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 42:1: E122 continuation line missing indentation or outdented

Comment last updated at 2023-11-01 13:08:42 UTC

@konbraphat51
Copy link
Contributor Author

I merged and modified @wannaphong PR. Please check.

@wannaphong
Copy link
Member

OK. It look ['เวลา', ' ', '12:12pm ', 'มี', 'โปรโมชั่น', ' ', '11.11'] is error. ['เวลา', ' ', '12:12pm',' ', 'มี', 'โปรโมชั่น', ' ', '11.11'] is correct.

@wannaphong
Copy link
Member

I fixed.

@konbraphat51
Copy link
Contributor Author

In my case, fb3e7bb showed
['เวลา', ' ', '12', ':12pm ', 'มี', 'โปรโมชั่น', ' ', '11.11']

The last [^\u0E00-\u0E7F]+ scoups up all the non-thai character until it hits Thai character

@wannaphong
Copy link
Member

In my case, fb3e7bb showed ['เวลา', ' ', '12', ':12pm ', 'มี', 'โปรโมชั่น', ' ', '11.11']

The last [^\u0E00-\u0E7F]+ scoups up all the non-thai character until it hits Thai character

3d889f7 showed

>>> from pythainlp.tokenize import word_tokenize
>>> word_tokenize("เวลา 12:12pm มีโปรโมชั่น 11.11")
['เวลา', ' ', '12:12pm', ' ', 'มี', 'โปรโมชั่น', ' ', '11.11']
>>> word_tokenize("กม/ชม")
['กม', '/', 'ชม']
>>> word_tokenize("(คนไม่เอา)")
['(', 'คน', 'ไม่', 'เอา', ')']
>>> word_tokenize("สีหน้า(รถ)")
['สีหน้า', '(', 'รถ', ')']

@konbraphat51
Copy link
Contributor Author

In my case, fb3e7bb showed ['เวลา', ' ', '12', ':12pm ', 'มี', 'โปรโมชั่น', ' ', '11.11']
The last [^\u0E00-\u0E7F]+ scoups up all the non-thai character until it hits Thai character

3d889f7 showed

>>> from pythainlp.tokenize import word_tokenize
>>> word_tokenize("เวลา 12:12pm มีโปรโมชั่น 11.11")
['เวลา', ' ', '12:12pm', ' ', 'มี', 'โปรโมชั่น', ' ', '11.11']
>>> word_tokenize("กม/ชม")
['กม', '/', 'ชม']
>>> word_tokenize("(คนไม่เอา)")
['(', 'คน', 'ไม่', 'เอา', ')']
>>> word_tokenize("สีหน้า(รถ)")
['สีหน้า', '(', 'รถ', ')']

Oh sorry, I was testing by segment() function. Didn't notice there was a post-process.
3d889f7 This seems good to me :)

Interntion for ` \t\r\n`
Copy link

sonarqubecloud bot commented Nov 1, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@konbraphat51
Copy link
Contributor Author

Added the commentation for further maintenance.

Copy link
Member

@wannaphong wannaphong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 💯

@wannaphong wannaphong requested a review from bact November 3, 2023 17:21
Copy link
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. Doesn't break the number grouping.

@wannaphong wannaphong merged commit de28159 into PyThaiNLP:dev Nov 5, 2023
10 of 12 checks passed
@bact bact mentioned this pull request Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bugs in the library
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Question: newmm tokenizer, why not just thai characters?
5 participants