v2.0.3: Improvements to tokenizer caching and serialization, plus various bug fixes
✨ New features and improvements
- Require Thinc
v6.10.1
to fix GPU installation fix and beam parsing. - Improve Turkish stop words.
- Improve Hindi stop words.
🔴 Bug fixes
- Fix issue #1248: Update English tokenizer and norm exceptions for "-in" and "-in'" verbs.
- Fix issue #1506: Fix
KeyError
from cleaning up strings duringLanguage.pipe
(work in progress). - Fix issue #1521: Ensure path in
Doc.to_disk
andDoc.from_disk
. - Fix issue #1525, #1582: Update fastText example to accommodate whitespace.
- Fix issue #1541: Remove broken link from documentation.
- Fix issue #1546: Add missing import to make
util.minibatch
work correctly. - Fix issue #1557: Add dummy serialization methods to Japanese tokenizer to allow saving and loading models.
- Fix caching in
Tokenizer
(partially addresses performance regression in #1371 and #1508).
📖 Documentation and examples
- Add "Videos" section to resources.
- Update training tips and advice section.
- Re-add
python -m
to CLI commands to ensure cross-platform compatibility. - Fix various typos and inconsistencies.
👥 Contributors
Thanks to @MathiasDesch, @mcsalgado, @Wahib, @ligser, @abhi18av, @DuyguA, @KMLDS and @yogendrasoni for the pull requests and contributions.