Add support for list of strings as input to sent_tokenize() #927

ayaan-qadri · 2024-10-08T08:52:43Z

What does this changes

sent_tokenizer function now also supports list of string

What was wrong

Before the changes, The sent_tokenizer function was taking string as parameter only.

How this fixes it

Joined the list of string using join method.

Fixes #906

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

Passed code styles and structures
Passed code linting checks and unit test

wannaphong

Hello! Thank you for your pull request! It looks may doesn't fixed the issues.

I think it should output as list of word too.

Example: ["ผม", "กิน", "ข้าว"," ","เธอ","เล่น" ,"เกม"] -> "ผมกินข้าว เธอเล่นเกม" (store index 0-1 as "ผม", 2-4 "กิน", ...) -> ["ผมกินข้าว"," ","เธอเล่นเกม"] (store index 0-1 as "ผม", 2-4 "กิน", ...) -> [["ผม", "กิน", "ข้าว"],[" "],["เธอ","เล่น" ,"เกม"]]

Can you edit the code?

ayaan-qadri · 2024-10-12T16:58:09Z

Hi @wannaphong, you did not specify those changes before. I made the changes as you requested. Feel free to ask if you need further changes.

wannaphong · 2024-10-12T21:36:19Z

Can you add those functions to sent_tokenize?

def indices_words(words):
    indices = []
    start_index = 0

    for word in words:
        if len(word)>1:
            _temp= len(word)-1
        else:
            _temp=1
        indices.append((start_index, start_index + _temp))
        start_index += len(word)

    return indices
list_word = ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']
index=indices_words(list_word)  # [(0, 1), (2, 8), (9, 10), (10, 12), (13, 16), (17, 19)]
list_sent=sent_tokenize(list_word)  # ['ผมกินข้าว ', 'เธอเล่นเกม']

import copy
def map_indices_to_words(index_list, sentences):
    result = []
    c=copy.copy(index_list)
    n_sum=0
    
    for sentence in sentences:
        words = sentence
        sentence_result = []
        n=0
        
        for start, end in c:
            if start > n_sum+len(words)-1:
                break
            else:
                word = sentence[start-n_sum:end+1-n_sum]
                sentence_result.append(word)
                n+=1

        result.append(sentence_result)
        n_sum+=len(words)
        for _ in range(n):
            del c[0]
        
    
    return result
list_sent_word = map_indices_to_words(index,list_sent)  #  [['ผม', 'กินข้าว', ' '], ['เธอ', 'เล่น', 'เกม']]

ayaan-qadri · 2024-10-13T06:34:31Z

@wannaphong , I'm a bit confused.

where do you want to implement 'map_indices_to_words' and 'indices_words' functions?
last changes for 'sent_tokenize' function are fine?
list_word = ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']
index=indices_words(list_word) # [(0, 1), (2, 8), (9, 10), (10, 12), (13, 16), (17, 19)]
list_sent=sent_tokenize(list_word) # ['ผมกินข้าว ', 'เธอเล่นเกม']

current sent_tokenize return that output (['ผมกินข้าว ', 'เธอเล่นเกม']) if keep_whitespace is false.

wannaphong · 2024-10-13T07:06:09Z

@wannaphong , I'm a bit confused.

1. where do you want to implement 'map_indices_to_words' and 'indices_words' functions?

2. last changes for 'sent_tokenize' function are fine?

3. > list_word = ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']
   > index=indices_words(list_word)  # [(0, 1), (2, 8), (9, 10), (10, 12), (13, 16), (17, 19)]
   > list_sent=sent_tokenize(list_word)  # ['ผมกินข้าว ', 'เธอเล่นเกม']


* current sent_tokenize return that output (['ผมกินข้าว ', 'เธอเล่นเกม']) if keep_whitespace is false.

Yes, it can still keep list of words. Every sentence tokenizer engine have different word tokenizer inside sentence tokenizer but the engine can output as list of sentence only, so we can keep list of words to restore list of words in list of sentences.

Example: I used deepcut tokenizer for my text and used crfcut that use newmm inside crfcut engine. The output is list of sentences. If I use map_indices_to_words, it can output as list of words (from deepcut) in list of sentences.

ayaan-qadri · 2024-10-13T08:09:25Z

Sorry, but I do not have any idea about the tokenization engine and how it works. Could you please provide more details? Then maybe I could help you with this.

wannaphong · 2024-10-13T08:26:36Z

Sorry, but I do not have any idea about the tokenization engine and how it works. Could you please provide more details? Then maybe I could help you with this.

from https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/core.py#L325-L507

Here are the sentence tokenization engines:

crfcut - (default) split by CRF trained on TED dataset
thaisum - The implementation of sentence segmenter from Nakhun Chumpolsathien, 2020
tltk - split by TLTK <https://pypi.org/project/tltk/>_.,
wtp - split by wtpsplita <https://github.com/bminixhofer/wtpsplit>_.
whitespace+newline - split by whitespace and newline.
whitespace - split by whitespace, specifically with regex pattern r" +"

crfcut, thaisum, tltk, and wtp has own our word tokenizer inside the engine, so i think it should preprocessing by store indices word index and use the indices to tokenize word in list of sentences.

List of words ( ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']) -> Text ("ผมกินข้าว เธอเล่นเกม" and store indices)-> sent tokenizer -> List of sentences (['ผมกินข้าว ', 'เธอเล่นเกม']) -> List of sentences and list of word inside the list ([['ผม', 'กินข้าว', ' '], ['เธอ', 'เล่น', 'เกม']])

pep8speaks · 2024-10-13T10:29:04Z

Hello @ayaan-qadri! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-10-28 15:29:40 UTC

ayaan-qadri · 2024-10-13T10:31:11Z

@wannaphong , Thanks for the explanation, Please check again, I have made mistake in past commit too sorry for that.

wannaphong · 2024-10-13T13:52:37Z

I found a bug.

The indices store whitespace but whitespace and whitespace+newline are tokenize by whitespace. Can you re-create word_indices in those engines?

word_indices = indices_words(text that are without whitespace in whitespace and whitespace+newline in whitespace+newline)

ayaan-qadri · 2024-10-13T17:39:05Z

@wannaphong , Check it now.

wannaphong · 2024-10-14T08:25:50Z

I found two bug in two case.

list_word=["ผม","กิน","ข้าว"," ","\n","เธอ","เล่น","เกม"]
sent_tokenize(list_word)

Results: [['ผม', 'กิน', 'ข้าว', ' \n', '\nเ', 'เธอ', 'เล่น', 'เกม']]

Expected results: [['ผม', 'กิน', 'ข้าว', ' ', '\n', 'เธอ', 'เล่น', 'เกม']]

list_word=["ผม","กิน","ข้าว"," ","\n","เธอ","เล่น","เกม"]
sent_tokenize(list_word,engine="whitespace")

Results: [['ผม', 'กิน', 'ข้าว'], ['\nเธ', 'อเล่', 'นเก']]

Expected results: [['ผม', 'กิน', 'ข้าว'], ['\n', 'เธอ', 'เล่น', 'เกม']]

ayaan-qadri · 2024-10-27T11:18:07Z

@wannaphong the commit fixes for default engine 'crfcut' but it is not producing result as mentioned for whitespace engine, Can you please handle from this point.

wannaphong · 2024-10-28T15:03:32Z

I fixed whitespace engine.

sonarqubecloud · 2024-10-28T15:30:14Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

wannaphong · 2024-10-28T16:27:22Z

Thank you @ayaan-qadri for your pull request!

Cheers! 🍻

Added list of string support to sent_tokenize/solved PyThaiNLP#906

2a95070

wannaphong reviewed Oct 11, 2024

View reviewed changes

Added list grouping

817b87a

Implemented indices_words & map_indices_to_words in sent_tokenize

5bbf410

Added return to map_indices_to_words

f1b9c11

Reolved Bug: whitespace+newline was tokenize by whitespace

77756c4

fixed bug 1 for crfcut engine

011fb66

bact added the enhancement enhance functionalities label Oct 27, 2024

Fixed list of string support in whitespace engine

0e9148d

wannaphong added the hacktoberfest-accepted hacktoberfest accepted pull requests. label Oct 28, 2024

wannaphong added 4 commits October 28, 2024 22:08

Fixed pep8

66c0647

Fixed pep8

54842ca

Fixed pep8

76f3310

Add list of words in sent_tokenize testset

1a2b457

wannaphong approved these changes Oct 28, 2024

View reviewed changes

wannaphong merged commit b11fe00 into PyThaiNLP:dev Oct 28, 2024
7 of 12 checks passed

bact mentioned this pull request Nov 2, 2024

PyThaiNLP 5.1 Change Log #900

Open

bact changed the title ~~Added list of string support to sent_tokenize~~ Allow sent_tokenize to accept a list of strings as input Nov 3, 2024

bact changed the title ~~Allow sent_tokenize to accept a list of strings as input~~ Add support for list of strings as input to sent_tokenize() Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for list of strings as input to sent_tokenize() #927

Add support for list of strings as input to sent_tokenize() #927

ayaan-qadri commented Oct 8, 2024 •

edited by wannaphong

Loading

wannaphong left a comment

ayaan-qadri commented Oct 12, 2024

wannaphong commented Oct 12, 2024 •

edited

Loading

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 13, 2024

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 13, 2024

pep8speaks commented Oct 13, 2024 •

edited

Loading

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 13, 2024

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 14, 2024

ayaan-qadri commented Oct 27, 2024

wannaphong commented Oct 28, 2024

sonarqubecloud bot commented Oct 28, 2024

wannaphong commented Oct 28, 2024

Add support for list of strings as input to sent_tokenize() #927

Add support for list of strings as input to sent_tokenize() #927

Conversation

ayaan-qadri commented Oct 8, 2024 • edited by wannaphong Loading

What does this changes

What was wrong

How this fixes it

Your checklist for this pull request

wannaphong left a comment

Choose a reason for hiding this comment

ayaan-qadri commented Oct 12, 2024

wannaphong commented Oct 12, 2024 • edited Loading

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 13, 2024

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 13, 2024

pep8speaks commented Oct 13, 2024 • edited Loading

Comment last updated at 2024-10-28 15:29:40 UTC

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 13, 2024

ayaan-qadri commented Oct 13, 2024

wannaphong commented Oct 14, 2024

ayaan-qadri commented Oct 27, 2024

wannaphong commented Oct 28, 2024

sonarqubecloud bot commented Oct 28, 2024

Quality Gate passed

wannaphong commented Oct 28, 2024

ayaan-qadri commented Oct 8, 2024 •

edited by wannaphong

Loading

wannaphong commented Oct 12, 2024 •

edited

Loading

pep8speaks commented Oct 13, 2024 •

edited

Loading