Releases: benbrandt/text-splitter
Releases · benbrandt/text-splitter
Python: v0.2.3 - Update to latest tokenizer crates
What's New
- Update to v0.4.4 of
text-splitter
to supporttokenizers@0.14.0
- Update
tokenizers
andtiktoken-rs
to latest versions
Full Changelog: python-v0.2.2...python-v0.2.3
v0.4.3
What's Changed
- Support
impl ChunkSizer
for&Tokenizer
and&CoreBPE
, allowing for generating chunks based off of a reference to a tokenizer as well, instead of requiring ownership. by @benbrandt in #37
Full Changelog: v0.4.2...v0.4.3
v0.4.2
What's Changed
- Loose tiktoken-rs version requirements by @benbrandt in #28
Full Changelog: v0.4.1...v0.4.2
Python v0.2.2
What's Changed
- Python: Update to text-splitter 0.4.2 by @benbrandt in #31
Full Changelog: python-v0.2.1...python-v0.2.2
Python v0.2.1 - OpenAI Tiktoken Support
What's Changed
- Support Open AI Tiktoken tokenizers. So you can now give an OpenAI model name to tokenize the text for when calculating chunk sizes. by @benbrandt in #23
from semantic_text_splitter import TiktokenTextSplitter
# Maximum number of tokens in a chunk
max_tokens = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = TiktokenTextSplitter("gpt-3.5-turbo", trim_chunks=False)
chunks = splitter.chunks("your document text", max_tokens)
Full Changelog: python-v0.2.0...python-v0.2.1
Python: v0.2.0 - Hugging Face Tokenizer support
What's New
- New
HuggingFaceTextSplitter
, which allows for using Hugging Face'stokenizers
package to count chunks by tokens with a tokenizer of your choice.
from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer
# Maximum number of tokens in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)
chunks = splitter.chunks("your document text", max_characters)
Breaking Changes
trim_chunks
now defaults toTrue
instead ofFalse
. For most use cases, this is the desired behavior, especially with chunk ranges.
Full Changelog: python-v0.1.4...python-v0.2.0
v0.4.1 - Remove unneeded `tokenizers` features
What's Changed
- Remove unnecessary tokenizer features by @benbrandt in #20
Full Changelog: v0.4.0...v0.4.1
Python: v0.1.4 - Fifth time is the charm?
Python: v0.1.3 - New package name
Had to adjust the package name so that it could upload to PyPi
from text_splitter import CharacterTextSplitter
# Maximum number of characters in a chunk
max_characters = 1000
# Optionally can also have the splitter trim whitespace for you
splitter = CharacterTextSplitter(trim_chunks=True)
chunks = splitter.chunks("your document text", max_characters)
Full Changelog: python-v0.1.2...python-v0.1.3
Python: v0.1.2 - Fix bad release
Apologies...first time publishing a python package...
Full Changelog: python-v0.1.1...python-v0.1.2