Skip to content

Releases: huggingface/tokenizers

Python v0.11.5

16 Feb 12:08
Compare
Choose a tag to compare

[#895] Add wheel support for Python 3.10

Rust v0.11.1

17 Jan 09:00
Compare
Choose a tag to compare
  • [#882] Fixing Punctuation deserialize without argument.
  • [#868] Fixing missing direction in TruncationParams
  • [#860] Adding TruncationSide to TruncationParams

Python v0.11.3

17 Jan 09:30
Compare
Choose a tag to compare
  • [#882] Fixing Punctuation deserialize without argument.
  • [#868] Fixing missing direction in TruncationParams
  • [#860] Adding TruncationSide to TruncationParams

Node v0.8.2

17 Jan 21:33
Compare
Choose a tag to compare

[#884] Fixing bad deserialization following inclusion of a default for Punctuation

Node v0.8.1

17 Jan 08:59
Compare
Choose a tag to compare

Fixing various backward compatibility bugs (Old serialized files couldn't be deserialized anymore.

Python v0.11.4

17 Jan 21:32
Compare
Choose a tag to compare
Python v0.11.4 Pre-release
Pre-release

[#884] Fixing bad deserialization following inclusion of a default for Punctuation

Python v0.11.2

04 Jan 13:59
Compare
Choose a tag to compare
Python v0.11.2 Pre-release
Pre-release

Fixes #868

Python v0.11.1

28 Dec 13:06
Compare
Choose a tag to compare

[#860] Adding TruncationSide to TruncationParams.

Python v0.11.0

24 Dec 09:15
Compare
Choose a tag to compare

Fixed

  • [#585] Conda version should now work on old CentOS
  • [#844] Fixing interaction between is_pretokenized and trim_offsets.
  • [#851] Doc links

Added

  • [#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
  • [#845]: Documentation for Decoders.

Changed

  • [#850]: Added a feature gate to enable disabling http features
  • [#718]: Fix WordLevel tokenizer determinism during training
  • [#762]: Add a way to specify the unknown token in SentencePieceUnigramTokenizer
  • [#770]: Improved documentation for UnigramTrainer
  • [#780]: Add Tokenizer.from_pretrained to load tokenizers from the Hugging Face Hub
  • [#793]: Saving a pretty JSON file by default when saving a tokenizer

Node v0.8.0

02 Sep 18:12
Compare
Choose a tag to compare

BREACKING CHANGES

  • Many improvements on the Trainer (#519).
    The files must now be provided first when calling tokenizer.train(files, trainer).

Features

  • Adding the TemplateProcessing
  • Add WordLevel and Unigram models (#490)
  • Add nmtNormalizer and precompiledNormalizer normalizers (#490)
  • Add templateProcessing post-processor (#490)
  • Add digitsPreTokenizer pre-tokenizer (#490)
  • Add support for mapping to sequences (#506)
  • Add splitPreTokenizer pre-tokenizer (#542)
  • Add behavior option to the punctuationPreTokenizer (#657)
  • Add the ability to load tokenizers from the Hugging Face Hub using fromPretrained (#780)

Fixes

  • Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)
  • Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)