Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for list of strings as input to sent_tokenize() #927

Merged
merged 11 commits into from
Oct 28, 2024

Conversation

ayaan-qadri
Copy link
Contributor

@ayaan-qadri ayaan-qadri commented Oct 8, 2024

What does this changes

sent_tokenizer function now also supports list of string

What was wrong

Before the changes, The sent_tokenizer function was taking string as parameter only.

How this fixes it

Joined the list of string using join method.

Fixes #906

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

  • Passed code styles and structures
  • Passed code linting checks and unit test

Copy link
Member

@wannaphong wannaphong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! Thank you for your pull request! It looks may doesn't fixed the issues.

I think it should output as list of word too.

Example: ["ผม", "กิน", "ข้าว"," ","เธอ","เล่น" ,"เกม"] -> "ผมกินข้าว เธอเล่นเกม" (store index 0-1 as "ผม", 2-4 "กิน", ...) -> ["ผมกินข้าว"," ","เธอเล่นเกม"] (store index 0-1 as "ผม", 2-4 "กิน", ...) -> [["ผม", "กิน", "ข้าว"],[" "],["เธอ","เล่น" ,"เกม"]]

Can you edit the code?

@ayaan-qadri
Copy link
Contributor Author

Hi @wannaphong, you did not specify those changes before. I made the changes as you requested. Feel free to ask if you need further changes.

@wannaphong
Copy link
Member

wannaphong commented Oct 12, 2024

Can you add those functions to sent_tokenize?

def indices_words(words):
    indices = []
    start_index = 0

    for word in words:
        if len(word)>1:
            _temp= len(word)-1
        else:
            _temp=1
        indices.append((start_index, start_index + _temp))
        start_index += len(word)

    return indices
list_word = ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']
index=indices_words(list_word)  # [(0, 1), (2, 8), (9, 10), (10, 12), (13, 16), (17, 19)]
list_sent=sent_tokenize(list_word)  # ['ผมกินข้าว ', 'เธอเล่นเกม']

import copy
def map_indices_to_words(index_list, sentences):
    result = []
    c=copy.copy(index_list)
    n_sum=0
    
    for sentence in sentences:
        words = sentence
        sentence_result = []
        n=0
        
        for start, end in c:
            if start > n_sum+len(words)-1:
                break
            else:
                word = sentence[start-n_sum:end+1-n_sum]
                sentence_result.append(word)
                n+=1

        result.append(sentence_result)
        n_sum+=len(words)
        for _ in range(n):
            del c[0]
        
    
    return result
list_sent_word = map_indices_to_words(index,list_sent)  #  [['ผม', 'กินข้าว', ' '], ['เธอ', 'เล่น', 'เกม']]

@ayaan-qadri
Copy link
Contributor Author

@wannaphong , I'm a bit confused.

  1. where do you want to implement 'map_indices_to_words' and 'indices_words' functions?
  2. last changes for 'sent_tokenize' function are fine?
  3. list_word = ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']
    index=indices_words(list_word) # [(0, 1), (2, 8), (9, 10), (10, 12), (13, 16), (17, 19)]
    list_sent=sent_tokenize(list_word) # ['ผมกินข้าว ', 'เธอเล่นเกม']

  • current sent_tokenize return that output (['ผมกินข้าว ', 'เธอเล่นเกม']) if keep_whitespace is false.

@wannaphong
Copy link
Member

@wannaphong , I'm a bit confused.

1. where do you want to implement 'map_indices_to_words' and 'indices_words' functions?

2. last changes for 'sent_tokenize' function are fine?

3. > list_word = ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']
   > index=indices_words(list_word)  # [(0, 1), (2, 8), (9, 10), (10, 12), (13, 16), (17, 19)]
   > list_sent=sent_tokenize(list_word)  # ['ผมกินข้าว ', 'เธอเล่นเกม']


* current sent_tokenize return that output (['ผมกินข้าว ', 'เธอเล่นเกม']) if keep_whitespace is false.

Yes, it can still keep list of words. Every sentence tokenizer engine have different word tokenizer inside sentence tokenizer but the engine can output as list of sentence only, so we can keep list of words to restore list of words in list of sentences.

Example: I used deepcut tokenizer for my text and used crfcut that use newmm inside crfcut engine. The output is list of sentences. If I use map_indices_to_words, it can output as list of words (from deepcut) in list of sentences.

@ayaan-qadri
Copy link
Contributor Author

Sorry, but I do not have any idea about the tokenization engine and how it works. Could you please provide more details? Then maybe I could help you with this.

@wannaphong
Copy link
Member

Sorry, but I do not have any idea about the tokenization engine and how it works. Could you please provide more details? Then maybe I could help you with this.

from https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/tokenize/core.py#L325-L507

Here are the sentence tokenization engines:

  • crfcut - (default) split by CRF trained on TED dataset
  • thaisum - The implementation of sentence segmenter from Nakhun Chumpolsathien, 2020
  • tltk - split by TLTK <https://pypi.org/project/tltk/>_.,
  • wtp - split by wtpsplita <https://github.com/bminixhofer/wtpsplit>_.
  • whitespace+newline - split by whitespace and newline.
  • whitespace - split by whitespace, specifically with regex pattern r" +"

crfcut, thaisum, tltk, and wtp has own our word tokenizer inside the engine, so i think it should preprocessing by store indices word index and use the indices to tokenize word in list of sentences.

List of words ( ['ผม', 'กินข้าว', ' ', 'เธอ', 'เล่น', 'เกม']) -> Text ("ผมกินข้าว เธอเล่นเกม" and store indices)-> sent tokenizer -> List of sentences (['ผมกินข้าว ', 'เธอเล่นเกม']) -> List of sentences and list of word inside the list ([['ผม', 'กินข้าว', ' '], ['เธอ', 'เล่น', 'เกม']])

@pep8speaks
Copy link

pep8speaks commented Oct 13, 2024

Hello @ayaan-qadri! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-10-28 15:29:40 UTC

@ayaan-qadri
Copy link
Contributor Author

@wannaphong , Thanks for the explanation, Please check again, I have made mistake in past commit too sorry for that.

@wannaphong
Copy link
Member

I found a bug.
image

The indices store whitespace but whitespace and whitespace+newline are tokenize by whitespace. Can you re-create word_indices in those engines?

word_indices = indices_words(text that are without whitespace in whitespace and whitespace+newline in whitespace+newline)

@ayaan-qadri
Copy link
Contributor Author

@wannaphong , Check it now.

@wannaphong
Copy link
Member

I found two bug in two case.

list_word=["ผม","กิน","ข้าว"," ","\n","เธอ","เล่น","เกม"]
sent_tokenize(list_word)

Results: [['ผม', 'กิน', 'ข้าว', ' \n', '\nเ', 'เธอ', 'เล่น', 'เกม']]

Expected results: [['ผม', 'กิน', 'ข้าว', ' ', '\n', 'เธอ', 'เล่น', 'เกม']]

list_word=["ผม","กิน","ข้าว"," ","\n","เธอ","เล่น","เกม"]
sent_tokenize(list_word,engine="whitespace")

Results: [['ผม', 'กิน', 'ข้าว'], ['\nเธ', 'อเล่', 'นเก']]

Expected results: [['ผม', 'กิน', 'ข้าว'], ['\n', 'เธอ', 'เล่น', 'เกม']]

@ayaan-qadri
Copy link
Contributor Author

@wannaphong the commit fixes for default engine 'crfcut' but it is not producing result as mentioned for whitespace engine, Can you please handle from this point.

@bact bact added the enhancement enhance functionalities label Oct 27, 2024
@wannaphong
Copy link
Member

I fixed whitespace engine.

@wannaphong wannaphong added the hacktoberfest-accepted hacktoberfest accepted pull requests. label Oct 28, 2024
Copy link

@wannaphong
Copy link
Member

Thank you @ayaan-qadri for your pull request!

Cheers! 🍻

@wannaphong wannaphong merged commit b11fe00 into PyThaiNLP:dev Oct 28, 2024
7 of 12 checks passed
@bact bact mentioned this pull request Nov 2, 2024
@bact bact changed the title Added list of string support to sent_tokenize Allow sent_tokenize to accept a list of strings as input Nov 3, 2024
@bact bact changed the title Allow sent_tokenize to accept a list of strings as input Add support for list of strings as input to sent_tokenize() Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement enhance functionalities hacktoberfest-accepted hacktoberfest accepted pull requests.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Get List of words to sent_tokenize
4 participants