Improve HF tokenization hack to cover multiple special tokens #649

shawnz · 2024-02-21T21:21:59Z

Like some others (#620, #609, #581, #556, #555, #494, #479) I have been running into issues with an assertion failure when my prompt contains special tokens. Particularly, it seems like the existing workaround doesn't account for multiple special tokens being present in the prompt, but this is necessary for example if you want to use the chat template provided with models such as HuggingFaceH4/zephyr-7b-beta, which contains special tokens as part of the template.

I don't fully understand the nature of the workaround or why it is necessary, but after some experimentation, the changes here made it work for my use case. I am posting them here in case it helps anyone experiencing the same issue and also in hopes of creating a discussion that might lead to a more robust solution. Perhaps @slundberg, who I believe implemented the original workaround, can chime in and let me know if I am on the right track with this change.

Thanks, Shawn

slundberg · 2024-02-22T20:26:08Z

Thanks! This is because if you take the string "my text" and tokenize it with some HF models and then untokenize it you get " my text"...so the tokenizers are not invertible because they add spaces after special tokens. It is annoying and seems wrong, but is not something we can change, hence the hack. Your update does not seem wrong, but since it seems to change the token count unit test I suspect it may be brittle to different model types. Do you mind adding a few tests to validate this for a few common models so make sure it does not break anything?

Long term I am hoping we can get around much of this by using a speculative decoder model. If we integrate with a speculative decoder we won't need to clean up tokens after forcing them (since the speculative decoder should follow the constraints)...and so we won't have to use the function at all. But of course that all depends on having access to a spec decoder which won't always be true :)

shawnz · 2024-02-22T20:34:37Z

You are right, this change is brittle and seems to have broken several tests! I am working on making it work in more cases right now and I think I might have a viable solution that fixes the existing tests while also making it work for my use case. In the mean time, if anyone has any suggestions or ideas for different approaches, I'm all ears!

shawnz · 2024-02-27T07:22:30Z

Closing this as I ultimately can't find a better solution which accounts for all the possible edge cases. That said, I was able to get it to work in my original use case by setting use_fast=False and legacy=False when calling AutoTokenizer.from_pretrained(...). Others facing similar issues might want to try and experiment with those options.

shawnz · 2024-03-01T17:04:56Z

Note that a competing project "outlines" is implementing a similar feature and they seem to have settled on a method that might be able to avoid some of these hacks? Perhaps it could be a source of inspiration for a future look at this, see: dottxt-ai/outlines#161 and dottxt-ai/outlines#531

shawnz added 3 commits February 21, 2024 16:12

Improve HF tokenization hack to cover multiple tokens

c4fc9d5

Oops, include last token in loop

1f74135

Get list of special tokens from tokenizer

e17bb18

shawnz marked this pull request as draft February 22, 2024 20:50

shawnz closed this Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve HF tokenization hack to cover multiple special tokens #649

Improve HF tokenization hack to cover multiple special tokens #649

shawnz commented Feb 21, 2024

slundberg commented Feb 22, 2024

shawnz commented Feb 22, 2024

shawnz commented Feb 27, 2024

shawnz commented Mar 1, 2024

Improve HF tokenization hack to cover multiple special tokens #649

Improve HF tokenization hack to cover multiple special tokens #649

Conversation

shawnz commented Feb 21, 2024

slundberg commented Feb 22, 2024

shawnz commented Feb 22, 2024

shawnz commented Feb 27, 2024

shawnz commented Mar 1, 2024