How do I use a different tokenizer? #1048

Nero10578 · 2024-01-05T10:58:59Z

Nero10578
Jan 5, 2024

I have expanded the tokenizer vocabulary in order to be able to train sundanes into llama models. But I am not too sure how do I use the custom tokenizer using axolotl.

I understand that there are these options:

# Optional tokenizer configuration override in case you want to use a different tokenizer
# than the one defined in the base model
tokenizer_config:
# Resize the model embeddings when new tokens are added to multiples of 32
# This is reported to improve training speed on some models
resize_token_embeddings_to_32x:
# If you added new tokens to the tokenizer, you may need to save some LoRA modules because they need to know the new tokens.
# For LLaMA and Mistral, you need to save `embed_tokens` and `lm_head`. It may vary for other models.
# `embed_tokens` converts tokens to embeddings, and `lm_head` converts embeddings to token probabilities.
# https://github.com/huggingface/peft/issues/334#issuecomment-1561727994
lora_modules_to_save:
#  - embed_tokens
#  - lm_head

What exactly do I have to put for these options in order to use my custom tokenizer? And where do I put the special_tokens_map.json, tokenizer.model and config file for axolotl to use them?

As far as I understand I need to change the word embeddings and language model head for a different tokenizer to work. As described in the chinese-alpaca llama project arxiv paper:
https://arxiv.org/pdf/2304.08177.pdf

To adapt the LLaMA model for the Chinese LLaMA tokenizer, we resize the word embeddings
and language model head from shape V × H to V
′ × H, where V = 32, 000 denotes the
original vocabulary size, and V
′ = 49, 953 is the new vocabulary size of the Chinese LLaMA
tokenizer. The new rows are appended to the end of the original embedding matrices, ensuring
that the embeddings of the tokens in the original vocabulary remain unaffected.

Does Axolotl not support custom tokenizer? So do I need to use my own python code for this? Any help is appreciated. Thank you!

UPDATE: I figured it out guys, I just didn't understand the yaml options explanation. You're supposed to put the path to the tokenizer folder containing the config and model file in the
tokenizer_config: /path/to/tokenizer
and also use the

lora_modules_to_save:
  - embed_tokens
  - lm_head

Suggestion to axolotl is to change the explanation instead of saying to put the tokenizer config in that option, say instead to put the tokenizer folder path.

Answered by NanoCode012

Feb 23, 2024

Thanks. I've made a PR to address this: #1323

View full answer

abodacs · 2024-01-12T18:51:53Z

abodacs
Jan 12, 2024

@Nero10578
Thanks for the edit, it helps.

Do you see good results for using a different/custom tokenizer for llama models?

5 replies

Nero10578 Jan 12, 2024
Author

Unfortunately I haven't had good results adding in sundanese tokens merged with the base tokenizer for training sundanese language that I am trying to do.