-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer's Interfaces Cleanup #7001
Tokenizer's Interfaces Cleanup #7001
Conversation
@michaelgsharp I appreciate it if you could review the changes. I have removed a couple of APIs you introduced earlier and provided a workaround for their usage. Thank you! |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #7001 +/- ##
==========================================
- Coverage 68.81% 68.80% -0.02%
==========================================
Files 1258 1258
Lines 250477 250652 +175
Branches 25576 25602 +26
==========================================
+ Hits 172377 172468 +91
- Misses 71473 71553 +80
- Partials 6627 6631 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
|
internal List<Token> TokenizeWithCache(string sequence) | ||
{ | ||
List<Token> tokens = new(word.SymbolsCount); | ||
Word word; | ||
if (Cache is not null) | ||
{ | ||
if (Cache.TryGet(sequence, out word)) | ||
{ | ||
return WordToTokens(ref word); | ||
} | ||
|
||
foreach (Token token in word.GetIterator(VocabReverse)) | ||
word = MergeWord(sequence); | ||
Cache.Set(sequence, word); | ||
} | ||
else | ||
{ | ||
tokens.Add(token); | ||
word = MergeWord(sequence); | ||
} | ||
|
||
return tokens; | ||
return WordToTokens(ref word); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this whole method just be:
List<Token> result = new();
TokenizeToIdsWithCache(sequence, result);
return result;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, TokenizeToIdsWithCache(sequence, result);
fill only the Ids and removing the overhead for filling the whole tokens. Note the difference in the implementation between TokenizeToIdsWithCache
and TokenizeWithCache
. The first is calling WordToIds
while the second is calling WordToTokens
/// <param name="sequence">The sequence to split.</param> | ||
/// <param name="isSpecialToken">Indicate if the token is a special token.</param> | ||
/// <param name="accumulatedIds">The list of accumulated tokenized Ids.</param> | ||
public override void TokenizeToIds(string sequence, bool isSpecialToken, IList<int> accumulatedIds) => TokenizeToIds(sequence, accumulatedIds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we use the word Tokenize vs the word Encode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the Encode
name at the Tokenizer class level to process the entire input text. At the level of the tokenizer's models, we use the Tokenize
name, which operates on smaller, pre-tokenized text units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few things that can be tweaked further, but that can be done in a follow-up.
This update encompasses the following:
Tokenizer.GetEncodedIdsCount
API, essential for supporting crucial scenarios and implemented it in all supported tokenizers.EncodeToIds
andGetEncodedIdsCount
has been customized for other tokenizer models likeBpe
andEnglishRoberta
. This adaptation aims to enhance the performance of these APIs specifically when invoked from those respective tokenizers.