Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer's Interfaces Cleanup #7001

Merged
merged 3 commits into from
Feb 16, 2024

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Feb 15, 2024

This update encompasses the following:

  • Introduction of the Tokenizer.GetEncodedIdsCount API, essential for supporting crucial scenarios and implemented it in all supported tokenizers.
  • The implementation of EncodeToIds and GetEncodedIdsCount has been customized for other tokenizer models like Bpe and EnglishRoberta. This adaptation aims to enhance the performance of these APIs specifically when invoked from those respective tokenizers.
  • Removal of a couple of APIs that did not align well with our interfaces.
  • Conducted cleanup and performance optimizations in various sections across the supported tokenizers.

@tarekgh
Copy link
Member Author

tarekgh commented Feb 15, 2024

@michaelgsharp I appreciate it if you could review the changes. I have removed a couple of APIs you introduced earlier and provided a workaround for their usage. Thank you!

@tarekgh
Copy link
Member Author

tarekgh commented Feb 15, 2024

Copy link

codecov bot commented Feb 15, 2024

Codecov Report

Attention: 104 lines in your changes are missing coverage. Please review.

Comparison is base (64523e8) 68.81% compared to head (7c61933) 68.80%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7001      +/-   ##
==========================================
- Coverage   68.81%   68.80%   -0.02%     
==========================================
  Files        1258     1258              
  Lines      250477   250652     +175     
  Branches    25576    25602      +26     
==========================================
+ Hits       172377   172468      +91     
- Misses      71473    71553      +80     
- Partials     6627     6631       +4     
Flag Coverage Δ
Debug 68.80% <65.21%> (-0.02%) ⬇️
production 63.27% <59.84%> (-0.02%) ⬇️
test 88.44% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/Microsoft.ML.Tokenizers/Model/Word.cs 84.37% <100.00%> (+0.62%) ⬆️
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs 100.00% <100.00%> (ø)
...crosoft.ML.Tokenizers.Tests/EnglishRobertaTests.cs 100.00% <100.00%> (ø)
test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs 100.00% <100.00%> (ø)
src/Microsoft.ML.TorchSharp/Roberta/QATrainer.cs 78.37% <66.66%> (+0.07%) ⬆️
src/Microsoft.ML.Tokenizers/Model/Cache.cs 40.98% <50.00%> (-3.64%) ⬇️
src/Microsoft.ML.Tokenizers/Model/Model.cs 0.00% <0.00%> (-7.70%) ⬇️
src/Microsoft.ML.Tokenizers/Model/BPE.cs 75.00% <81.25%> (+4.31%) ⬆️
src/Microsoft.ML.Tokenizers/Tokenizer.cs 83.40% <75.00%> (+1.93%) ⬆️
src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs 47.46% <59.09%> (+1.65%) ⬆️
... and 1 more

... and 2 files with indirect coverage changes

Comment on lines +423 to +441
internal List<Token> TokenizeWithCache(string sequence)
{
List<Token> tokens = new(word.SymbolsCount);
Word word;
if (Cache is not null)
{
if (Cache.TryGet(sequence, out word))
{
return WordToTokens(ref word);
}

foreach (Token token in word.GetIterator(VocabReverse))
word = MergeWord(sequence);
Cache.Set(sequence, word);
}
else
{
tokens.Add(token);
word = MergeWord(sequence);
}

return tokens;
return WordToTokens(ref word);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this whole method just be:

List<Token> result = new();
TokenizeToIdsWithCache(sequence, result);
return result;

?

Copy link
Member Author

@tarekgh tarekgh Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, TokenizeToIdsWithCache(sequence, result); fill only the Ids and removing the overhead for filling the whole tokens. Note the difference in the implementation between TokenizeToIdsWithCache and TokenizeWithCache. The first is calling WordToIds while the second is calling WordToTokens

/// <param name="sequence">The sequence to split.</param>
/// <param name="isSpecialToken">Indicate if the token is a special token.</param>
/// <param name="accumulatedIds">The list of accumulated tokenized Ids.</param>
public override void TokenizeToIds(string sequence, bool isSpecialToken, IList<int> accumulatedIds) => TokenizeToIds(sequence, accumulatedIds);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we use the word Tokenize vs the word Encode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the Encode name at the Tokenizer class level to process the entire input text. At the level of the tokenizer's models, we use the Tokenize name, which operates on smaller, pre-tokenized text units.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few things that can be tweaked further, but that can be done in a follow-up.

@michaelgsharp michaelgsharp merged commit 4635a86 into dotnet:main Feb 16, 2024
25 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants