FastTokenizer for LLaMa #22114

theblackcat102 · 2023-03-12T15:14:09Z

Feature request

FastTokenizer support for LLaMa sentencepiece tokenizer.

Motivation

The offset_mapping is only available in FastTokenizer, it would be useful if there's support for this.

Your contribution

I have tried using existing sentencepiece based model as replacement. However hf conversation code means we are missing the byte fallback support

The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers

Which means out of vocabulary tokens are simply mapped to instead of using the byte mapping inside the vocab.

The text was updated successfully, but these errors were encountered:

sgugger · 2023-03-13T15:12:33Z

Let's maybe wait for the LLaMa PR to be merged first?

dongs0104 · 2023-03-17T15:11:57Z

it is fix on tokenizers

huggingface/tokenizers#1183

github-actions · 2023-04-30T15:02:08Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastTokenizer for LLaMa #22114

FastTokenizer for LLaMa #22114

theblackcat102 commented Mar 12, 2023

sgugger commented Mar 13, 2023

dongs0104 commented Mar 17, 2023

github-actions bot commented Apr 30, 2023

FastTokenizer for LLaMa #22114

FastTokenizer for LLaMa #22114

Comments

theblackcat102 commented Mar 12, 2023

Feature request

Motivation

Your contribution

sgugger commented Mar 13, 2023

dongs0104 commented Mar 17, 2023

github-actions bot commented Apr 30, 2023