How do I use a different tokenizer? #1048
-
I have expanded the tokenizer vocabulary in order to be able to train sundanes into llama models. But I am not too sure how do I use the custom tokenizer using axolotl. I understand that there are these options:
What exactly do I have to put for these options in order to use my custom tokenizer? And where do I put the special_tokens_map.json, tokenizer.model and config file for axolotl to use them? As far as I understand I need to change the word embeddings and language model head for a different tokenizer to work. As described in the chinese-alpaca llama project arxiv paper:
Does Axolotl not support custom tokenizer? So do I need to use my own python code for this? Any help is appreciated. Thank you! UPDATE: I figured it out guys, I just didn't understand the yaml options explanation. You're supposed to put the path to the tokenizer folder containing the config and model file in the
Suggestion to axolotl is to change the explanation instead of saying to put the tokenizer config in that option, say instead to put the tokenizer folder path. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
@Nero10578 Do you see good results for using a different/custom tokenizer for llama models? |
Beta Was this translation helpful? Give feedback.
-
Thanks. I've made a PR to address this: #1323 |
Beta Was this translation helpful? Give feedback.
Thanks. I've made a PR to address this: #1323