Tokenizer's Interfaces Cleanup #7001

tarekgh · 2024-02-15T01:14:42Z

This update encompasses the following:

Introduction of the Tokenizer.GetEncodedIdsCount API, essential for supporting crucial scenarios and implemented it in all supported tokenizers.
The implementation of EncodeToIds and GetEncodedIdsCount has been customized for other tokenizer models like Bpe and EnglishRoberta. This adaptation aims to enhance the performance of these APIs specifically when invoked from those respective tokenizers.
Removal of a couple of APIs that did not align well with our interfaces.
Conducted cleanup and performance optimizations in various sections across the supported tokenizers.

tarekgh · 2024-02-15T01:17:49Z

@michaelgsharp I appreciate it if you could review the changes. I have removed a couple of APIs you introduced earlier and provided a workaround for their usage. Thank you!

tarekgh · 2024-02-15T01:18:47Z

CC @luisquintanilla @stephentoub @ericstj @LittleLittleCloud

codecov · 2024-02-15T02:38:38Z

Codecov Report

Attention: 104 lines in your changes are missing coverage. Please review.

Comparison is base (64523e8) 68.81% compared to head (7c61933) 68.80%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7001      +/-   ##
==========================================
- Coverage   68.81%   68.80%   -0.02%     
==========================================
  Files        1258     1258              
  Lines      250477   250652     +175     
  Branches    25576    25602      +26     
==========================================
+ Hits       172377   172468      +91     
- Misses      71473    71553      +80     
- Partials     6627     6631       +4

Flag	Coverage Δ
Debug	`68.80% <65.21%> (-0.02%)`	⬇️
production	`63.27% <59.84%> (-0.02%)`	⬇️
test	`88.44% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
src/Microsoft.ML.Tokenizers/Model/Word.cs	`84.37% <100.00%> (+0.62%)`	⬆️
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs	`100.00% <100.00%> (ø)`
...crosoft.ML.Tokenizers.Tests/EnglishRobertaTests.cs	`100.00% <100.00%> (ø)`
test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs	`100.00% <100.00%> (ø)`
src/Microsoft.ML.TorchSharp/Roberta/QATrainer.cs	`78.37% <66.66%> (+0.07%)`	⬆️
src/Microsoft.ML.Tokenizers/Model/Cache.cs	`40.98% <50.00%> (-3.64%)`	⬇️
src/Microsoft.ML.Tokenizers/Model/Model.cs	`0.00% <0.00%> (-7.70%)`	⬇️
src/Microsoft.ML.Tokenizers/Model/BPE.cs	`75.00% <81.25%> (+4.31%)`	⬆️
src/Microsoft.ML.Tokenizers/Tokenizer.cs	`83.40% <75.00%> (+1.93%)`	⬆️
src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs	`47.46% <59.09%> (+1.65%)`	⬆️
... and 1 more

... and 2 files with indirect coverage changes

src/Microsoft.ML.Tokenizers/Model/BPE.cs

stephentoub · 2024-02-15T16:48:14Z

src/Microsoft.ML.Tokenizers/Model/BPE.cs

+        internal List<Token> TokenizeWithCache(string sequence)
        {
-            List<Token> tokens = new(word.SymbolsCount);
+            Word word;
+            if (Cache is not null)
+            {
+                if (Cache.TryGet(sequence, out word))
+                {
+                    return WordToTokens(ref word);
+                }

-            foreach (Token token in word.GetIterator(VocabReverse))
+                word = MergeWord(sequence);
+                Cache.Set(sequence, word);
+            }
+            else
            {
-                tokens.Add(token);
+                word = MergeWord(sequence);
            }

-            return tokens;
+            return WordToTokens(ref word);


Can this whole method just be:

List<Token> result = new(); TokenizeToIdsWithCache(sequence, result); return result;

?

No, TokenizeToIdsWithCache(sequence, result); fill only the Ids and removing the overhead for filling the whole tokens. Note the difference in the implementation between TokenizeToIdsWithCache and TokenizeWithCache. The first is calling WordToIds while the second is calling WordToTokens

stephentoub · 2024-02-15T16:49:11Z

src/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs

+        /// <param name="sequence">The sequence to split.</param>
+        /// <param name="isSpecialToken">Indicate if the token is a special token.</param>
+        /// <param name="accumulatedIds">The list of accumulated tokenized Ids.</param>
+        public override void TokenizeToIds(string sequence, bool isSpecialToken, IList<int> accumulatedIds) => TokenizeToIds(sequence, accumulatedIds);


When do we use the word Tokenize vs the word Encode?

We use the Encode name at the Tokenizer class level to process the entire input text. At the level of the tokenizer's models, we use the Tokenize name, which operates on smaller, pre-tokenized text units.

src/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs

src/Microsoft.ML.Tokenizers/Model/Model.cs

src/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs

stephentoub

There are a few things that can be tweaked further, but that can be done in a follow-up.

Tokenizer's Interfaces Cleanup

3ae3af9

dotnet-policy-service bot assigned tarekgh Feb 15, 2024

tarekgh requested a review from michaelgsharp February 15, 2024 01:15

stephentoub reviewed Feb 15, 2024

View reviewed changes

src/Microsoft.ML.Tokenizers/Model/BPE.cs Outdated Show resolved Hide resolved

stephentoub reviewed Feb 15, 2024

View reviewed changes

src/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs Outdated Show resolved Hide resolved

stephentoub reviewed Feb 15, 2024

View reviewed changes

src/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs Show resolved Hide resolved

stephentoub reviewed Feb 15, 2024

View reviewed changes

src/Microsoft.ML.Tokenizers/Model/Model.cs Outdated Show resolved Hide resolved

Address the feedback

11f04c8

LittleLittleCloud reviewed Feb 15, 2024

View reviewed changes

src/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs Show resolved Hide resolved

build-analysis bot mentioned this pull request Feb 15, 2024

Possible deadlock when running Microsoft.ML.Tests #6996

Open

Optimization

7c61933

stephentoub approved these changes Feb 16, 2024

View reviewed changes

michaelgsharp approved these changes Feb 16, 2024

View reviewed changes

michaelgsharp merged commit 4635a86 into dotnet:main Feb 16, 2024
25 checks passed

github-actions bot locked and limited conversation to collaborators Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer's Interfaces Cleanup #7001

Tokenizer's Interfaces Cleanup #7001

tarekgh commented Feb 15, 2024

tarekgh commented Feb 15, 2024

tarekgh commented Feb 15, 2024

codecov bot commented Feb 15, 2024 •

edited

Loading

stephentoub Feb 15, 2024

tarekgh Feb 15, 2024 •

edited

Loading

stephentoub Feb 15, 2024

tarekgh Feb 15, 2024

stephentoub left a comment

Tokenizer's Interfaces Cleanup #7001

Tokenizer's Interfaces Cleanup #7001

Conversation

tarekgh commented Feb 15, 2024

tarekgh commented Feb 15, 2024

tarekgh commented Feb 15, 2024

codecov bot commented Feb 15, 2024 • edited Loading

Codecov Report

stephentoub Feb 15, 2024

Choose a reason for hiding this comment

tarekgh Feb 15, 2024 • edited Loading

Choose a reason for hiding this comment

stephentoub Feb 15, 2024

Choose a reason for hiding this comment

tarekgh Feb 15, 2024

Choose a reason for hiding this comment

stephentoub left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 15, 2024 •

edited

Loading

tarekgh Feb 15, 2024 •

edited

Loading