-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for list of strings as input to sent_tokenize() #927
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! Thank you for your pull request! It looks may doesn't fixed the issues.
I think it should output as list of word too.
Example: ["ผม", "กิน", "ข้าว"," ","เธอ","เล่น" ,"เกม"] -> "ผมกินข้าว เธอเล่นเกม" (store index 0-1 as "ผม", 2-4 "กิน", ...) -> ["ผมกินข้าว"," ","เธอเล่นเกม"] (store index 0-1 as "ผม", 2-4 "กิน", ...) -> [["ผม", "กิน", "ข้าว"],[" "],["เธอ","เล่น" ,"เกม"]]
Can you edit the code?
Hi @wannaphong, you did not specify those changes before. I made the changes as you requested. Feel free to ask if you need further changes. |
Can you add those functions to sent_tokenize? def indices_words(words):
indices = []
start_index = 0
for word in words:
if len(word)>1:
_temp= len(word)-1
else:
_temp=1
indices.append((start_index, start_index + _temp))
start_index += len(word)
return indices
list_word = ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']
index=indices_words(list_word) # [(0, 1), (2, 8), (9, 10), (10, 12), (13, 16), (17, 19)]
list_sent=sent_tokenize(list_word) # ['ผมกินข้าว ', 'เธอเล่นเกม']
import copy
def map_indices_to_words(index_list, sentences):
result = []
c=copy.copy(index_list)
n_sum=0
for sentence in sentences:
words = sentence
sentence_result = []
n=0
for start, end in c:
if start > n_sum+len(words)-1:
break
else:
word = sentence[start-n_sum:end+1-n_sum]
sentence_result.append(word)
n+=1
result.append(sentence_result)
n_sum+=len(words)
for _ in range(n):
del c[0]
return result
list_sent_word = map_indices_to_words(index,list_sent) # [['ผม', 'กินข้าว', ' '], ['เธอ', 'เล่น', 'เกม']] |
@wannaphong , I'm a bit confused.
|
Yes, it can still keep list of words. Every sentence tokenizer engine have different word tokenizer inside sentence tokenizer but the engine can output as list of sentence only, so we can keep list of words to restore list of words in list of sentences. Example: I used deepcut tokenizer for my text and used crfcut that use newmm inside crfcut engine. The output is list of sentences. If I use map_indices_to_words, it can output as list of words (from deepcut) in list of sentences. |
Sorry, but I do not have any idea about the tokenization engine and how it works. Could you please provide more details? Then maybe I could help you with this. |
from https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/core.py#L325-L507 Here are the sentence tokenization engines:
crfcut, thaisum, tltk, and wtp has own our word tokenizer inside the engine, so i think it should preprocessing by store indices word index and use the indices to tokenize word in list of sentences. List of words ( ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']) -> Text ("ผมกินข้าว เธอเล่นเกม" and store indices)-> sent tokenizer -> List of sentences (['ผมกินข้าว ', 'เธอเล่นเกม']) -> List of sentences and list of word inside the list ([['ผม', 'กินข้าว', ' '], ['เธอ', 'เล่น', 'เกม']]) |
Hello @ayaan-qadri! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2024-10-28 15:29:40 UTC |
@wannaphong , Thanks for the explanation, Please check again, I have made mistake in past commit too sorry for that. |
@wannaphong , Check it now. |
I found two bug in two case.
Results: Expected results:
Results: Expected results: |
@wannaphong the commit fixes for default engine 'crfcut' but it is not producing result as mentioned for whitespace engine, Can you please handle from this point. |
I fixed whitespace engine. |
Quality Gate passedIssues Measures |
Thank you @ayaan-qadri for your pull request! Cheers! 🍻 |
What does this changes
sent_tokenizer function now also supports list of string
What was wrong
Before the changes, The sent_tokenizer function was taking string as parameter only.
How this fixes it
Joined the list of string using join method.
Fixes #906
Your checklist for this pull request
🚨Please review the guidelines for contributing to this repository.