Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow tokenizer to customize stop_tokens #84

Merged
merged 1 commit into from
May 17, 2024
Merged

Allow tokenizer to customize stop_tokens #84

merged 1 commit into from
May 17, 2024

Conversation

qihqi
Copy link
Collaborator

@qihqi qihqi commented May 17, 2024

No description provided.

@qihqi qihqi requested review from bhavya01 and JoeZijunZhou May 17, 2024 05:07
@qihqi qihqi requested a review from vipannalla as a code owner May 17, 2024 05:07
@qihqi qihqi force-pushed the hanq_dev branch 3 times, most recently from 442972c to af8ae17 Compare May 17, 2024 16:59
@@ -349,6 +349,11 @@ def bos_id(self) -> int:
"""ID of the BOS token."""
return self.vocab.bos_id

@property
Copy link
Contributor

@FanhaiLu1 FanhaiLu1 May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we want to return different stop_tokens, we don't need this function in child class.

Remove the stop_tokens function here does not affect behavior. By default, it will call parent stop_tokens.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -395,6 +400,11 @@ def decode(self, token_ids: list[int]) -> str:
"""
return self.tokenizer.decode(token_ids)

@property
Copy link
Contributor

@FanhaiLu1 FanhaiLu1 May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, remove this function in child class will not affect behavior.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def stop_tokens(self) -> set[int]:
"""ID of the stop token."""
return {self.eos_id, self.pad_id}


Copy link
Contributor

@FanhaiLu1 FanhaiLu1 May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoeZijunZhou What is TikToken? (some tokenizer from tiktok or? )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's llama3's tokenizer (apperently made by OpenAI)

@qihqi qihqi requested a review from FanhaiLu1 May 17, 2024 17:46
@FanhaiLu1 FanhaiLu1 merged commit 8128c8a into main May 17, 2024
3 checks passed
@FanhaiLu1 FanhaiLu1 deleted the hanq_dev branch May 17, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants