Skip to content

v2.4.4

Latest
Compare
Choose a tag to compare
@marijnkoolen marijnkoolen released this 07 Jan 09:46
· 11 commits to master since this release

Improve tokenizers (introduced in 2.4.1)

  • Improve efficiency of default tokenizer
  • Add option to RegExTokenizer to use split_pattern (the pattern that separates tokens, and that will be removed) or token_pattern (the pattern for tokens and that will be retained)
  • Make boundary tokens have length zero so that char indexes of text tokens correspond to original text

Full Changelog: 2.4.3...v2.4.4