Skip to content

v2.0.3: Improvements to tokenizer caching and serialization, plus various bug fixes

Compare
Choose a tag to compare
@ines ines released this 15 Nov 15:50
· 8465 commits to master since this release

✨ New features and improvements

  • Require Thinc v6.10.1 to fix GPU installation fix and beam parsing.
  • Improve Turkish stop words.
  • Improve Hindi stop words.

🔴 Bug fixes

  • Fix issue #1248: Update English tokenizer and norm exceptions for "-in" and "-in'" verbs.
  • Fix issue #1506: Fix KeyError from cleaning up strings during Language.pipe (work in progress).
  • Fix issue #1521: Ensure path in Doc.to_disk and Doc.from_disk.
  • Fix issue #1525, #1582: Update fastText example to accommodate whitespace.
  • Fix issue #1541: Remove broken link from documentation.
  • Fix issue #1546: Add missing import to make util.minibatch work correctly.
  • Fix issue #1557: Add dummy serialization methods to Japanese tokenizer to allow saving and loading models.
  • Fix caching in Tokenizer (partially addresses performance regression in #1371 and #1508).

📖 Documentation and examples

👥 Contributors

Thanks to @MathiasDesch, @mcsalgado, @Wahib, @ligser, @abhi18av, @DuyguA, @KMLDS and @yogendrasoni for the pull requests and contributions.