fix(tokenizer): handle fast tokenizer properly for bos/eos #914

NanoCode012 · 2023-12-04T16:19:46Z

Reading through the FastLlama tokenizer docs, they stated:

If you want to change the bos_token or the eos_token, make sure to specify them when initializing the model, or call tokenizer.update_post_processor() to make sure that the post-processing is correctly done (otherwise the values of the first token and final token of an encoded sequence will not be correct). For more details, checkout [post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation.

Ref: https://huggingface.co/docs/transformers/main/en/model_doc/llama#transformers.LlamaTokenizerFast

…i-cloud#914)

fix(tokenizer): handle fast tokenizer properly for bos/eos

54fca29

winglian approved these changes Dec 7, 2023

View reviewed changes

NanoCode012 merged commit fde091c into axolotl-ai-cloud:main Dec 8, 2023
4 checks passed

NanoCode012 deleted the fix/fasttokenizer branch December 8, 2023 02:31

mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023

fix(tokenizer): handle fast tokenizer properly for bos/eos (axolotl-a…

bfe1425

…i-cloud#914)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tokenizer): handle fast tokenizer properly for bos/eos #914

fix(tokenizer): handle fast tokenizer properly for bos/eos #914

NanoCode012 commented Dec 4, 2023

fix(tokenizer): handle fast tokenizer properly for bos/eos #914

fix(tokenizer): handle fast tokenizer properly for bos/eos #914

Conversation

NanoCode012 commented Dec 4, 2023