Releases · marijnkoolen/fuzzy-search

07 Jan 09:46

v2.4.4 Latest

Latest

Improve tokenizers (introduced in 2.4.1)

Improve efficiency of default tokenizer
Add option to RegExTokenizer to use split_pattern (the pattern that separates tokens, and that will be removed) or token_pattern (the pattern for tokens and that will be retained)
Make boundary tokens have length zero so that char indexes of text tokens correspond to original text

Full Changelog: 2.4.3...v2.4.4

Assets 2

20 Dec 13:53

v2.4.0

Add a vocabulary to allow setting distractor pairs for common text terms matching phrase terms, to do early pruning of pairs of text tokens and phrase tokens.
Switch to using python-levelshtein for faster Levenshtein computation and early stopping.
Add an option to pad text and phrase tokens with boundary characters (#) to increase matches when one of the beginning or ending characters matches, or for very short words (shorter than ngram size).

Full Changelog: v2.3.0...v2.4.0

Assets 2