Skip to content

Releases: benbrandt/text-splitter

v0.6.0

14 Jan 07:19
Compare
Choose a tag to compare

Breaking Changes

  • Chunk behavior should now be the same as prior to v0.5.0. Once binary search finds the optimal chunk, we now check the next few sections as long as the chunk size doesn't change. This should result in the same behavior as before, but with the performance improvements of binary search. @benbrandt in #81

Full Changelog: v0.5.1...v0.6.0

v0.5.1

13 Jan 14:23
53cc041
Compare
Choose a tag to compare

What's New

  • Python bindings and Rust crate now have the same version number.

Rust

  • Constructors for ChunkSize are now public, so you can more easily create your own ChunkSize structs for your own custom ChunkSizer implementation.

Python

  • New CustomTextSplitter that accepts a custom callback with the signature of (str) -> int. Allows for custom chunk sizing on the Python side. @benbrandt in #80

Full Changelog: v0.5.0...v0.5.1

v0.5.0

27 Dec 19:26
e716aa9
Compare
Choose a tag to compare

What's New

  • Significant performance improvements for generating chunks with the tokenizers or tiktoken-rs crates by applying binary search when attempting to find the next matching chunk size. @benbrandt and @bradfier in #71

Breaking Changes

  • Minimum required version of tokenizers is now 0.15.0
  • Minimum required version of tiktoken-rs is now 0.5.6
  • Due to using binary search, there are some slight differences at the edges of chunks where the algorithm was a little greedier before. If two candidates would tokenize to the same amount of tokens that fit within the capacity, it will now choose the shorter text. Due to the nature of of tokenizers, this happens more often with whitespace at the end of a chunk, and rarely effects users who have set with_trim_chunks(true). It is a tradeoff, but would have made the binary search code much more complicated to keep the exact same behavior.
  • The chunk_size method on ChunkSizer now needs to accept a ChunkCapacity argument, and return a ChunkSize struct instead of a usize. This was to help support the new binary search method in chunking, and should only affect users who implemented custom ChunkSizers and weren't using one of the provided ones.
    • New signature: fn chunk_size(&self, chunk: &str, capacity: &impl ChunkCapacity) -> ChunkSize;

Full Changelog: v0.4.5...v0.5.0

Python: v0.3.1

27 Dec 20:06
Compare
Choose a tag to compare

Fix broken release

Python: v0.3.0

27 Dec 19:56
e82517a
Compare
Choose a tag to compare

What's New

  • Update to v0.5.0 of text-splitter for significant performance improvements for generating chunks with the tokenizers or tiktoken-rs crates by applying binary search when attempting to find the next matching chunk size.

Breaking Changes

  • Minimum Python version is now 3.8.
  • Due to using binary search, there are some slight differences at the edges of chunks where the algorithm was a little greedier before. If two candidates would tokenize to the same amount of tokens that fit within the capacity, it will now choose the shorter text. Due to the nature of of tokenizers, this happens more often with whitespace at the end of a chunk, and rarely effects users who have set trim_chunks=true. It is a tradeoff, but would have made the binary search code much more complicated to keep the exact same behavior.

Full Changelog: python-v0.2.4...python-v0.3.0

v0.4.5

15 Nov 15:07
Compare
Choose a tag to compare

What's Changed

  • Support tokenizers crate v0.15.0
  • Minimum Supported Rust Version is now 1.65.0

New Contributors

Full Changelog: v0.4.4...v0.4.5

Python: v0.2.4 - Update to latest tokenizer crates

15 Nov 15:21
58169b1
Compare
Choose a tag to compare

What's Changed

  • Update to v0.4.5 of text-splitter to support tokenizers@0.15.0
  • Update tokenizers and tiktoken-rs to latest version

v0.4.4

11 Sep 09:12
Compare
Choose a tag to compare

What's New

  • Support tokenizers crate v0.14.0
  • Minimum Supported Rust Version is now 1.61.0

Full Changelog: v0.4.3...v0.4.4

Python: v0.2.3 - Update to latest tokenizer crates

11 Sep 09:51
Compare
Choose a tag to compare

What's New

  • Update to v0.4.4 of text-splitter to support tokenizers@0.14.0
  • Update tokenizers and tiktoken-rs to latest versions

Full Changelog: python-v0.2.2...python-v0.2.3

v0.4.3

10 Aug 19:27
2af8bd8
Compare
Choose a tag to compare

What's Changed

  • Support impl ChunkSizer for &Tokenizer and &CoreBPE, allowing for generating chunks based off of a reference to a tokenizer as well, instead of requiring ownership. by @benbrandt in #37

Full Changelog: v0.4.2...v0.4.3