Releases: huggingface/tokenizers
Releases · huggingface/tokenizers
Python v0.11.5
[#895] Add wheel support for Python 3.10
Rust v0.11.1
Python v0.11.3
Node v0.8.2
[#884] Fixing bad deserialization following inclusion of a default for Punctuation
Node v0.8.1
Fixing various backward compatibility bugs (Old serialized files couldn't be deserialized anymore.
Python v0.11.4
[#884] Fixing bad deserialization following inclusion of a default for Punctuation
Python v0.11.2
Fixes #868
Python v0.11.1
[#860] Adding TruncationSide
to TruncationParams
.
Python v0.11.0
Fixed
- [#585] Conda version should now work on old CentOS
- [#844] Fixing interaction between
is_pretokenized
andtrim_offsets
. - [#851] Doc links
Added
- [#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
- [#845]: Documentation for
Decoders
.
Changed
- [#850]: Added a feature gate to enable disabling
http
features - [#718]: Fix
WordLevel
tokenizer determinism during training - [#762]: Add a way to specify the unknown token in
SentencePieceUnigramTokenizer
- [#770]: Improved documentation for
UnigramTrainer
- [#780]: Add
Tokenizer.from_pretrained
to load tokenizers from the Hugging Face Hub - [#793]: Saving a pretty JSON file by default when saving a tokenizer
Node v0.8.0
BREACKING CHANGES
- Many improvements on the Trainer (#519).
The files must now be provided first when callingtokenizer.train(files, trainer)
.
Features
- Adding the
TemplateProcessing
- Add
WordLevel
andUnigram
models (#490) - Add
nmtNormalizer
andprecompiledNormalizer
normalizers (#490) - Add
templateProcessing
post-processor (#490) - Add
digitsPreTokenizer
pre-tokenizer (#490) - Add support for mapping to sequences (#506)
- Add
splitPreTokenizer
pre-tokenizer (#542) - Add
behavior
option to thepunctuationPreTokenizer
(#657) - Add the ability to load tokenizers from the Hugging Face Hub using
fromPretrained
(#780)