Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] #37: Incorrect indexing when repitition is present in the text #87

Merged
merged 4 commits into from
Dec 11, 2024

Conversation

bhavnicksm
Copy link
Collaborator

This pull request includes several changes to the src/chonkie/chunker module, focusing on improving the chunking process and refactoring the code for better readability and maintainability. The most important changes include modifications to the _prepare_sentences method, the introduction of the _create_chunks method, and updates to the chunking logic in both token.py and word.py.

Changes to chunking process:

  • src/chonkie/chunker/sentence.py: Removed the _prepare_sentences method and its associated logic, which was responsible for preparing sentences with estimated or accurate token counts. This method is now commented out.

  • src/chonkie/chunker/token.py: Introduced a new _create_chunks method that packages texts as Chunk objects and returns the result. This method is used to simplify the chunk method. [1] [2]

Code refactoring:

  • src/chonkie/chunker/word.py: Updated the _create_chunk method to include a current_index parameter, allowing for more accurate chunk creation by finding the start index from the current index.

  • src/chonkie/chunker/word.py: Modified the chunk method to use the updated _create_chunk method with the current_index parameter, ensuring correct chunk overlaps and maintaining the current index throughout the process.

These changes collectively improve the accuracy and efficiency of the chunking process while making the codebase easier to maintain.

- Updated the TokenChunker class to replace the _process_batch method with _create_chunks for improved clarity and functionality.
- This change enhances the overall structure of the code and aligns with recent refactoring efforts in the chunking classes.
- Updated the _create_chunk method to include current_index as a parameter for better control over chunk starting index.
- Adjusted the logic in the chunking process to utilize the new current_index parameter, enhancing the accuracy of chunk creation.
- This refactor improves code clarity and maintains consistency with recent changes in other chunking classes.
- Removed unnecessary space adjustment in position calculation for sentences, as they are already separated by spaces.
- Commented out the _prepare_sentences method to streamline the class and focus on the essential functionality.
- This change enhances code clarity and prepares for future improvements in sentence processing.
@bhavnicksm bhavnicksm merged commit d35e755 into development Dec 11, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant