Skip to content

v0.5.0

Compare
Choose a tag to compare
@benbrandt benbrandt released this 27 Dec 19:26
· 511 commits to main since this release
e716aa9

What's New

  • Significant performance improvements for generating chunks with the tokenizers or tiktoken-rs crates by applying binary search when attempting to find the next matching chunk size. @benbrandt and @bradfier in #71

Breaking Changes

  • Minimum required version of tokenizers is now 0.15.0
  • Minimum required version of tiktoken-rs is now 0.5.6
  • Due to using binary search, there are some slight differences at the edges of chunks where the algorithm was a little greedier before. If two candidates would tokenize to the same amount of tokens that fit within the capacity, it will now choose the shorter text. Due to the nature of of tokenizers, this happens more often with whitespace at the end of a chunk, and rarely effects users who have set with_trim_chunks(true). It is a tradeoff, but would have made the binary search code much more complicated to keep the exact same behavior.
  • The chunk_size method on ChunkSizer now needs to accept a ChunkCapacity argument, and return a ChunkSize struct instead of a usize. This was to help support the new binary search method in chunking, and should only affect users who implemented custom ChunkSizers and weren't using one of the provided ones.
    • New signature: fn chunk_size(&self, chunk: &str, capacity: &impl ChunkCapacity) -> ChunkSize;

Full Changelog: v0.4.5...v0.5.0