You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I noticed that ML now support TikTokenizer by setting the --tokenizer-type argument. But I do not know what i should set with --tokenizer-model. I have checked the source code and find that we should pass a json file, and the function below will convert the json file to Tiktoken format.
The comment says " Reload our tokenizer JSON file and convert it to Tiktoken format." What does "our tokenizer JSON" means? which format should the json file be?
This discussion was converted from issue #1213 on October 23, 2024 21:30.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
I noticed that ML now support TikTokenizer by setting the --tokenizer-type argument. But I do not know what i should set with --tokenizer-model. I have checked the source code and find that we should pass a json file, and the function below will convert the json file to Tiktoken format.
Megatron-LM/megatron/training/tokenizer/tokenizer.py
Line 581 in 772faca
The comment says " Reload our tokenizer JSON file and convert it to Tiktoken format." What does "our tokenizer JSON" means? which format should the json file be?
Beta Was this translation helpful? Give feedback.
All reactions