-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow tokenizer to customize stop_tokens #84
Conversation
442972c
to
af8ae17
Compare
jetstream/engine/token_utils.py
Outdated
@@ -349,6 +349,11 @@ def bos_id(self) -> int: | |||
"""ID of the BOS token.""" | |||
return self.vocab.bos_id | |||
|
|||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we want to return different stop_tokens, we don't need this function in child class.
Remove the stop_tokens function here does not affect behavior. By default, it will call parent stop_tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -395,6 +400,11 @@ def decode(self, token_ids: list[int]) -> str: | |||
""" | |||
return self.tokenizer.decode(token_ids) | |||
|
|||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, remove this function in child class will not affect behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jetstream/engine/token_utils.py
Outdated
def stop_tokens(self) -> set[int]: | ||
"""ID of the stop token.""" | ||
return {self.eos_id, self.pad_id} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoeZijunZhou What is TikToken? (some tokenizer from tiktok or? )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's llama3's tokenizer (apperently made by OpenAI)
No description provided.