Releases: georg-jung/FastBertTokenizer
Releases · georg-jung/FastBertTokenizer
v1.0.28
Highlights
- The API surface is considered stable now (except the parts explicitly marked as experimental).
- Support for
netstandard2.0
was added. - Native AOT is fully supported.
- FastBertTokenizer is now almost allocation free. That makes single-threaded encoding a bit faster and leads to larger improvements when encoding multi-threaded.
Breaking Changes
- Method signature changed: The
Encode
overload that returnedReadOnlyMemory
returnsMemory
now instead. The old design made sense as the Memory points to a buffer internal to FastBertTokenizer. onnxruntime requires Memory instead of ReadOnlyMemory though. Writing to the buffer from outside doesn't break FastBertTokenizer, so it's okay to expose the buffer as Memory instead of ReadOnlyMemory to simplify usage with onnxruntime.- public (ReadOnlyMemory<long> InputIds, ReadOnlyMemory<long> AttentionMask, ReadOnlyMemory<long> TokenTypeIds) Encode(string input, int maximumTokens = 512, int? padTo = null) + public (Memory<long> InputIds, Memory<long> AttentionMask, Memory<long> TokenTypeIds) Encode(string input, int maximumTokens = 512, int? padTo = null)
- Some APIs are marked as experimental now. None were before, so it might be required to add
<NoWarn>FBERTTOK001</NoWarn> <!-- Experimental FastBertTokenizer features -->
to yourcsproj
if you use them. - The methods with the name
Tokenize
that were marked as obsolete before and were just renamed toEncode
are removed.
Other
- Fixed #39 Add Decode support for input_id sequences that don't start at a word prefix
v0.4.67
- The
Tokenize
methods are now calledEncode
as that better expresses what they do.- The old methods still exist for now as redirects, marked as
Obsolete
.
- The old methods still exist for now as redirects, marked as
- Add new
CreateBatchEnumerator
andCreateAsyncBatchEnumerator
APIs that support encoding of inputs that are longer than what fits in one model input (overlap/stride). - Make the
Encode
overload that returnsReadOnlyMemory
s re-use it's internal buffers. - Use
FrozenDictionary
on .NET 8. - Add support for reading configuration from
tokenizer.json
files. - Add a
LoadFromHuggingFaceAsync
method to ease getting started. Decode
produces cleaner outputs (e.g. "hello, world" instead of "hello , world").- Produce results more similar to Hugging Face (don't unicode normalize token lookup keys).
- Greatly improve test coverage and thus correctness verification.
v0.3.29
Breaking Changes
PreTokenizer
is now internal instead of public, as it should have been before too.- The publicly visible API now uses
string
instead ofReadOnlySpan/Memory<char>
- This enables better unicode normalization handling without having to create a string based on all inputs first
New Features
- Automatically test correctness of tokenization against Huggingface tokenizers using unit tests
- Added support for multi-threaded tokenization
- On a 8-core notebook CPU multithreaded tokenization is 3x faster than singlethreaded tokenization
- In a GitHub actions runner it is about 2x faster
- Inputs are unicode normalized prior to tokenization if (and only if) required
v0.2.7
- deterministic builds
- icon and readme in .nupkg
v0.1.29
Fix nuget release notes