Skip to content

Releases: benbrandt/text-splitter

Python: v0.2.3 - Update to latest tokenizer crates

11 Sep 09:51
Compare
Choose a tag to compare

What's New

  • Update to v0.4.4 of text-splitter to support tokenizers@0.14.0
  • Update tokenizers and tiktoken-rs to latest versions

Full Changelog: python-v0.2.2...python-v0.2.3

v0.4.3

10 Aug 19:27
2af8bd8
Compare
Choose a tag to compare

What's Changed

  • Support impl ChunkSizer for &Tokenizer and &CoreBPE, allowing for generating chunks based off of a reference to a tokenizer as well, instead of requiring ownership. by @benbrandt in #37

Full Changelog: v0.4.2...v0.4.3

v0.4.2

02 Jul 20:30
2f5f718
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.4.1...v0.4.2

Python v0.2.2

02 Jul 20:41
fed9dde
Compare
Choose a tag to compare

What's Changed

Full Changelog: python-v0.2.1...python-v0.2.2

Python v0.2.1 - OpenAI Tiktoken Support

13 Jun 19:59
Compare
Choose a tag to compare

What's Changed

  • Support Open AI Tiktoken tokenizers. So you can now give an OpenAI model name to tokenize the text for when calculating chunk sizes. by @benbrandt in #23
from semantic_text_splitter import TiktokenTextSplitter

# Maximum number of tokens in a chunk
max_tokens = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = TiktokenTextSplitter("gpt-3.5-turbo", trim_chunks=False)

chunks = splitter.chunks("your document text", max_tokens)

Full Changelog: python-v0.2.0...python-v0.2.1

Python: v0.2.0 - Hugging Face Tokenizer support

12 Jun 08:37
fc3709a
Compare
Choose a tag to compare

What's New

  • New HuggingFaceTextSplitter, which allows for using Hugging Face's tokenizers package to count chunks by tokens with a tokenizer of your choice.
from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)

Breaking Changes

  • trim_chunks now defaults to True instead of False. For most use cases, this is the desired behavior, especially with chunk ranges.

Full Changelog: python-v0.1.4...python-v0.2.0

v0.4.1 - Remove unneeded `tokenizers` features

11 Jun 05:50
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.4.0...v0.4.1

Python: v0.1.4 - Fifth time is the charm?

09 Jun 05:01
Compare
Choose a tag to compare

Python: v0.1.3 - New package name

09 Jun 04:44
Compare
Choose a tag to compare

Had to adjust the package name so that it could upload to PyPi

from text_splitter import CharacterTextSplitter

# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter trim whitespace for you
splitter = CharacterTextSplitter(trim_chunks=True)

chunks = splitter.chunks("your document text", max_characters)

Full Changelog: python-v0.1.2...python-v0.1.3

Python: v0.1.2 - Fix bad release

08 Jun 21:09
Compare
Choose a tag to compare

Apologies...first time publishing a python package...

Full Changelog: python-v0.1.1...python-v0.1.2