Skip to content

More API feedback on Tokenizer and friends #7013

@stephentoub

Description

@stephentoub

Tokenizer:

  • Tokenizer is mutable after creation? e.g. you can set its PreTokenizer, Normalizer, and Decoder after it's created. If the intent is for a Tokenizer to be created once for a process and shared, is that mutability warranted / wise? Should the constructor accept a Decoder (it doesn't today) and then the properties made get-only? Or at a minimum should the properties be made init instead of set?
  • There are a bunch of parameters named "sequence", with XML comments then describing them as "text". Would "text" be a more obvious name for the parameter?
  • I find the double-negative of "bool skipSpecialTokens = false" confusing: that being false means special tokens should be considered. Would it be better as "bool considerSpecialTokens = true" or something similar? I also don't know what it means to "skip special tokens"... they're not actually being skipped, right? They might still be returned as tokens, they're just not treated any differently from all other tokens. If we were actually "skipping special tokens", I'd expect that the implementation would always be looking for them and just not yielding / counting / etc. them when discovered, and I don't believe that's what's happening.
  • Tokenizer supports both Encode and Decode, but the result from Encode is a TokenizerResult type. Shouldn't that type be named something specific to encoding, since it's not relevant to other operations Tokenizer exposes?
  • What is the purpose of TrainFromFiles? It seems to produce an IReadOnlyList that's then entirely ignored. Should this method be deleted entirely?

Token:

  • Why is this mutable?
  • Typically in .NET APIs that deal with slices of things specify an offset and a length, typically as separate properties. These APIs are specifying starting and ending indices, as a tuple, named Index and End. That feels very inconsistent. I'd have expected an Offset/Index property and a Length property. Same feedback for Split and any other similar types.

PreTokenizer:

  • The parameter to PreTokenizer is called "sentence". Why is it called "sentence" here, "sequence" in other places, "text" or "input" in various comments, etc. Must this actually be a "sentence"?

NormalizedString:

  • It's not clear to me what the int[] mapping is. I see the description in the comment, but it's not clear enough. I'm guessing that mapping[i] for i being a position in original is the corresponding index into normalizedString?
  • The constructor parameters' names don't match the property names, e.g. ctor parameter is normalizedString but the property is Normalized, ctor parameter is mapping but property is NormalizedToOriginalMapping, etc.

Trainer:

  • Does anyone use this Trainer stuff? Can it be deleted?
  • All the Progress stuff seems only relevant to that. Can it be deleted, too?

Model:

  • Methods should start with a verb. TokenToId and IdToToken don't follow the typical guidance.
  • Why is GetVocabSize needed? Can't someone just use GetVocab().Count? Seems like that's what all the implementations do.
  • Why is GetVocab a method rather than a Vocab (or Vocabulary) property?
  • Are Save and GetTrainer needed?

TikToken:

  • tikTokenBpeFile parameter should be bpeFilePath or something like that. If it's just called file, folks might infer that to mean it's the contents of the file.
  • specialTokensEncoder is a strange name. It's just a dictionary and doesn't do any encoding. Should it just be specialTokens?
  • tikTokenBpeFileStream parameter suggests this must be a "FileStream"... it needn't be. This should just be bpeStream or something like that.

I suggest we take the surface area of Microsoft.ML.Tokenizers through an API review to get more eyes on it.
cc: @terrajobst, @tarekgh

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions