[Fix] #37: Incorrect indexing when repitition is present in the text #87
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request includes several changes to the
src/chonkie/chunker
module, focusing on improving the chunking process and refactoring the code for better readability and maintainability. The most important changes include modifications to the_prepare_sentences
method, the introduction of the_create_chunks
method, and updates to the chunking logic in bothtoken.py
andword.py
.Changes to chunking process:
src/chonkie/chunker/sentence.py
: Removed the_prepare_sentences
method and its associated logic, which was responsible for preparing sentences with estimated or accurate token counts. This method is now commented out.src/chonkie/chunker/token.py
: Introduced a new_create_chunks
method that packages texts asChunk
objects and returns the result. This method is used to simplify thechunk
method. [1] [2]Code refactoring:
src/chonkie/chunker/word.py
: Updated the_create_chunk
method to include acurrent_index
parameter, allowing for more accurate chunk creation by finding the start index from the current index.src/chonkie/chunker/word.py
: Modified thechunk
method to use the updated_create_chunk
method with thecurrent_index
parameter, ensuring correct chunk overlaps and maintaining the current index throughout the process.These changes collectively improve the accuracy and efficiency of the chunking process while making the codebase easier to maintain.