v0.2.2
Highlights
- Added Token Estimate Validate Loops inside the SentenceChunker for higher speed of upto ~5x at times
- Added
auto
thresholding mode for SemanticChunkers to removesimilarity_threshold
hard requirement. SemanticChunkers can decide on their own threshold now, based on the minimum and maximum - Added
OverlapRefinery
for adding overlap context to the chunks.chunk_overlap
parameter will be deprecated in the future forOverlapRefinery
instead.
What's Changed
- [Fix] AutoEmbeddings not loading
all-minilm-l6-v2
but loadsAll-MiniLM-L6-V2
by @bhavnicksm in #57 - [Fix] Add fix for #55 by @bhavnicksm in #58
- [Refactor] Add min_chunk_size parameter to SemanticChunker and SentenceChunker by @bhavnicksm in #60
- [Update] Bump version to 0.2.1.post1 and require Python 3.9 or higher by @bhavnicksm in #62
- [Update] Change default embedding model in SemanticChunkers by @bhavnicksm in #63
- Add
min_chunk_size
to SDPMChunker + Lint codebase with ruff + minor changes by @bhavnicksm in #68 - Added automated testing using Github Actions by @pratyushmittal in #66
- Add support for automated testing with Github Actions by @bhavnicksm in #69
- [Fix] Allow for functions as token_counters in BaseChunkers by @bhavnicksm in #70
- Add TEVL to speed up sentence chunker by @bhavnicksm in #71
- Add TEVL to speed-up sentence chunking by @bhavnicksm in #72
- Update the docs path to docs.chonkie.ai by @bhavnicksm in #75
- [FEAT] Add BaseRefinery and OverlapRefinery support by @bhavnicksm in #77
- Add support for BaseRefinery and OverlapRefinery + minor changes by @bhavnicksm in #78
- [FEAT] Add "auto" threshold configuration via Statistical analysis in SemanticChunker + minor fixes by @bhavnicksm in #79
- [Fix] Unify dataclasses under a types.py for ease by @bhavnicksm in #80
- Expose the seperation delim for simple multilingual chunking by @bhavnicksm in #81
- Bump version to v0.2.2 for release by @bhavnicksm in #82
New Contributors
- @pratyushmittal made their first contribution in #66
Full Changelog: v0.2.1...v0.2.2