Releases: benbrandt/text-splitter
Releases · benbrandt/text-splitter
v0.6.0
Breaking Changes
- Chunk behavior should now be the same as prior to v0.5.0. Once binary search finds the optimal chunk, we now check the next few sections as long as the chunk size doesn't change. This should result in the same behavior as before, but with the performance improvements of binary search. @benbrandt in #81
Full Changelog: v0.5.1...v0.6.0
v0.5.1
What's New
- Python bindings and Rust crate now have the same version number.
Rust
- Constructors for
ChunkSize
are now public, so you can more easily create your ownChunkSize
structs for your own customChunkSizer
implementation.
Python
- New
CustomTextSplitter
that accepts a custom callback with the signature of(str) -> int
. Allows for custom chunk sizing on the Python side. @benbrandt in #80
Full Changelog: v0.5.0...v0.5.1
v0.5.0
What's New
- Significant performance improvements for generating chunks with the
tokenizers
ortiktoken-rs
crates by applying binary search when attempting to find the next matching chunk size. @benbrandt and @bradfier in #71
Breaking Changes
- Minimum required version of
tokenizers
is now0.15.0
- Minimum required version of
tiktoken-rs
is now0.5.6
- Due to using binary search, there are some slight differences at the edges of chunks where the algorithm was a little greedier before. If two candidates would tokenize to the same amount of tokens that fit within the capacity, it will now choose the shorter text. Due to the nature of of tokenizers, this happens more often with whitespace at the end of a chunk, and rarely effects users who have set
with_trim_chunks(true)
. It is a tradeoff, but would have made the binary search code much more complicated to keep the exact same behavior. - The
chunk_size
method onChunkSizer
now needs to accept aChunkCapacity
argument, and return aChunkSize
struct instead of ausize
. This was to help support the new binary search method in chunking, and should only affect users who implemented customChunkSizer
s and weren't using one of the provided ones.- New signature:
fn chunk_size(&self, chunk: &str, capacity: &impl ChunkCapacity) -> ChunkSize;
- New signature:
Full Changelog: v0.4.5...v0.5.0
Python: v0.3.1
Fix broken release
Python: v0.3.0
What's New
- Update to
v0.5.0
oftext-splitter
for significant performance improvements for generating chunks with thetokenizers
ortiktoken-rs
crates by applying binary search when attempting to find the next matching chunk size.
Breaking Changes
- Minimum Python version is now 3.8.
- Due to using binary search, there are some slight differences at the edges of chunks where the algorithm was a little greedier before. If two candidates would tokenize to the same amount of tokens that fit within the capacity, it will now choose the shorter text. Due to the nature of of tokenizers, this happens more often with whitespace at the end of a chunk, and rarely effects users who have set
trim_chunks=true
. It is a tradeoff, but would have made the binary search code much more complicated to keep the exact same behavior.
Full Changelog: python-v0.2.4...python-v0.3.0
v0.4.5
What's Changed
- Support
tokenizers
crate v0.15.0 - Minimum Supported Rust Version is now 1.65.0
New Contributors
- @FullMetalMeowchemist made their first contribution in #53
Full Changelog: v0.4.4...v0.4.5
Python: v0.2.4 - Update to latest tokenizer crates
What's Changed
- Update to v0.4.5 of
text-splitter
to supporttokenizers@0.15.0
- Update
tokenizers
andtiktoken-rs
to latest version
v0.4.4
What's New
- Support
tokenizers
crate v0.14.0 - Minimum Supported Rust Version is now 1.61.0
Full Changelog: v0.4.3...v0.4.4
Python: v0.2.3 - Update to latest tokenizer crates
What's New
- Update to v0.4.4 of
text-splitter
to supporttokenizers@0.14.0
- Update
tokenizers
andtiktoken-rs
to latest versions
Full Changelog: python-v0.2.2...python-v0.2.3
v0.4.3
What's Changed
- Support
impl ChunkSizer
for&Tokenizer
and&CoreBPE
, allowing for generating chunks based off of a reference to a tokenizer as well, instead of requiring ownership. by @benbrandt in #37
Full Changelog: v0.4.2...v0.4.3