[Fix] #37: Incorrect indexing when repitition is present in the text #87

bhavnicksm · 2024-12-11T22:16:28Z

This pull request includes several changes to the src/chonkie/chunker module, focusing on improving the chunking process and refactoring the code for better readability and maintainability. The most important changes include modifications to the _prepare_sentences method, the introduction of the _create_chunks method, and updates to the chunking logic in both token.py and word.py.

Changes to chunking process:

src/chonkie/chunker/sentence.py: Removed the _prepare_sentences method and its associated logic, which was responsible for preparing sentences with estimated or accurate token counts. This method is now commented out.
src/chonkie/chunker/token.py: Introduced a new _create_chunks method that packages texts as Chunk objects and returns the result. This method is used to simplify the chunk method. [1] [2]

Code refactoring:

src/chonkie/chunker/word.py: Updated the _create_chunk method to include a current_index parameter, allowing for more accurate chunk creation by finding the start index from the current index.
src/chonkie/chunker/word.py: Modified the chunk method to use the updated _create_chunk method with the current_index parameter, ensuring correct chunk overlaps and maintaining the current index throughout the process.

These changes collectively improve the accuracy and efficiency of the chunking process while making the codebase easier to maintain.

- Updated the TokenChunker class to replace the _process_batch method with _create_chunks for improved clarity and functionality. - This change enhances the overall structure of the code and aligns with recent refactoring efforts in the chunking classes.

- Updated the _create_chunk method to include current_index as a parameter for better control over chunk starting index. - Adjusted the logic in the chunking process to utilize the new current_index parameter, enhancing the accuracy of chunk creation. - This refactor improves code clarity and maintains consistency with recent changes in other chunking classes.

- Removed unnecessary space adjustment in position calculation for sentences, as they are already separated by spaces. - Commented out the _prepare_sentences method to streamline the class and focus on the essential functionality. - This change enhances code clarity and prepares for future improvements in sentence processing.

bhavnicksm added 4 commits December 12, 2024 03:13

[Fix] indexing logic for TokenChunker for fn chunk

b62d557

bhavnicksm merged commit d35e755 into development Dec 11, 2024
1 check failed

bhavnicksm mentioned this pull request Dec 11, 2024

[BUG] start_index and end_index inaccurate for repetitive text chunks #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] #37: Incorrect indexing when repitition is present in the text #87

[Fix] #37: Incorrect indexing when repitition is present in the text #87

bhavnicksm commented Dec 11, 2024

[Fix] #37: Incorrect indexing when repitition is present in the text #87

[Fix] #37: Incorrect indexing when repitition is present in the text #87

Conversation

bhavnicksm commented Dec 11, 2024

Changes to chunking process:

Code refactoring: