-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the Tokenizer object logic #1874
Conversation
|
||
def decode(self, request: DecodeRequest) -> DecodeRequestResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No action: we should have an eventual plan to remove tokenize
and decode
from this class e.g. by introducing AutoTokenizer
or TokenizerFactory
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a TODO for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good! I have just one high level restructure request - see the comment on CachableTokenizer
. I can re-review once that's done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good at a high level. Will re-review tokenizers after the structure is finalized.
scripts/cache/fix_anthropic_cache.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No action required: We should probably delete some of these scripts eventually so that we don't have to keep maintaining them...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the tokenizer files are just copy-pasted, but let me know if there is anything specific I should look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of them yes. I added a Cahce for LitGpt and made a few syntax changes in most of them but they should behave the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me with that? This might be related to the error I faced bellow
… and _decode_raw_response_to_text
Cache Test Results: I tried to check if we were reusing the existing Cache.
Here are the results:
Everything seems to work except for the
|
Cache test looks good, thanks!
|
handle_module_not_found_error(e) | ||
|
||
|
||
class LitGPTTokenizer(CachingTokenizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the Lit-GPT tokenizer was a singleton before but not is not a singleton here.
This PR introduces the
Tokenizer
object and deprecates thetokenize
anddecode
methods ofClient
.Here are the main changes:
Client
is now pure abstract and there is a newCachableClient
which requires acache_config
as well as atokenizer
.CachableClient
still implements thetokenize
anddecode
methods by calling its tokenizer and raising a warning.Tokenizer
is introduced. it's pure abstract and should implement the two previous methods. Most tokenizers actually inherit fromCachableTokenizer
which handles all the caching and formating of the requests and responses. A few more infos:use_encode_in_cache_key
is an attribute that controls the cache key. As some tokenizers return both the ids and strings, we keep both to make the Cache more powerful.encode
argument in the request which is a problem (AI21Tokenizer
,HTTPModelTokenizer
,LitGPTTokenizer
,SimpleTokenizer
)TiktokenTokenizer
supportsdecode
fortokens
being aList[str]
while it should be aList[int]
. This is needed as without it I remember we were having issues but it's something that should be fixed.AutoClient
is now the only client that inherits directly fromClient
and notCachableClient
. The_get_tokenizer_client() -> Client
method is deleted to be replaced by a simpler_get_tokenizer() -> Tokenizer
. It still supports thetokenize
anddecode
methods for now (which is not ideal and should be removed in a future PR).