Improve tokenizers (introduced in 2.4.1)
- Improve efficiency of default tokenizer
- Add option to
RegExTokenizer
to usesplit_pattern
(the pattern that separates tokens, and that will be removed) ortoken_pattern
(the pattern for tokens and that will be retained) - Make boundary tokens have length zero so that char indexes of text tokens correspond to original text
Full Changelog: 2.4.3...v2.4.4