Adding ByteFallback support for `tokenizers`. #1183

Narsil · 2023-03-17T14:51:51Z

Two items added:

A flag byte_fallback for the BPE model. This will be in charge
of using <0x61> instead of unk on unknown tokens.
A ByteFallback decoder, which will be in charge of putting everything
back into string whenever possible. Showing � when the byte decoding
fails (behavior checked against LlamaTokenizer in transformers.

Fixes #929

Two items added: - A flag `byte_fallback` for the `BPE` model. This will be in charge of using `<0x61>` instead of unk on unknown tokens. - A ByteFallback decoder, which will be in charge of putting everything back into string whenever possible. Showing � when the byte decoding fails (behavior checked against LlamaTokenizer in `transformers`.

sgugger

Thanks for doing this so quickly!

HuggingFaceDocBuilderDev · 2023-03-17T15:06:34Z

The documentation is not available anymore as the PR was closed or merged.

bindings/python/test.py

- Requires huggingface/tokenizers#1183 version - Only support byte_fallback for llama, raise otherwise (safety net). - Lots of questions are special tokens How to test: ```python from transformers.convert_slow_tokenizer import convert_slow_tokenizer from transformers import AutoTokenizer from tokenizers import Tokenizer tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b") if False: new_tokenizer = Tokenizer.from_file("tok.json") else: new_tokenizer = convert_slow_tokenizer(tokenizer) new_tokenizer.save("tok.json") strings = [ "This is a test", "生活的真谛是", "生活的真谛是[MASK]。", # XXX: This one is problematic because of special tokens # "<s> Something something", ] for string in strings: encoded = tokenizer(string)["input_ids"] encoded2 = new_tokenizer.encode(string).ids assert encoded == encoded2, f"{encoded} != {encoded2}" decoded = tokenizer.decode(encoded) decoded2 = new_tokenizer.decode(encoded2) assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}" ```

theblackcat102 · 2023-03-21T14:45:11Z

Can confirm working with byte_fallback set to true for llama sentence piece

>>> from tokenizers import Tokenizer
>>> tokenizer = Tokenizer.from_pretrained("theblackcat102/llama-7b-test")

without byte fallback

>>> tokenizer.encode("你好啊測試測試").ids
[29871, 30919, 31076, 0]

with byte fallback

>>> tokenizer.encode("你好啊測試測試").ids
[29871, 30919, 31076, 232, 152, 141, 233, 187, 175, 235, 172, 169, 233, 187, 175, 235, 172, 169]

Narsil · 2023-03-22T10:51:38Z

The bye fallback is working as advertised.

However I still find some differences in tokenization between llama spm and tokenizers after converting one to the other.

It's usually a simple matter of proper configuration, but this particular model is showcasing odd issues, which require much lower level investigation than usual. Hang tight.

- Requires huggingface/tokenizers#1183 version - Only support byte_fallback for llama, raise otherwise (safety net). - Lots of questions are special tokens How to test: ```python from transformers.convert_slow_tokenizer import convert_slow_tokenizer from transformers import AutoTokenizer from tokenizers import Tokenizer tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b") if False: new_tokenizer = Tokenizer.from_file("tok.json") else: new_tokenizer = convert_slow_tokenizer(tokenizer) new_tokenizer.save("tok.json") strings = [ "This is a test", "生活的真谛是", "生活的真谛是[MASK]。", # XXX: This one is problematic because of special tokens # "<s> Something something", ] for string in strings: encoded = tokenizer(string)["input_ids"] encoded2 = new_tokenizer.encode(string).ids assert encoded == encoded2, f"{encoded} != {encoded2}" decoded = tokenizer.decode(encoded) decoded2 = new_tokenizer.decode(encoded2) assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}" ```

- Requires huggingface/tokenizers#1183 version - Only support byte_fallback for llama, raise otherwise (safety net). - Lots of questions are special tokens How to test: ```python from transformers.convert_slow_tokenizer import convert_slow_tokenizer from transformers import AutoTokenizer from tokenizers import Tokenizer tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b") if False: new_tokenizer = Tokenizer.from_file("tok.json") else: new_tokenizer = convert_slow_tokenizer(tokenizer) new_tokenizer.save("tok.json") strings = [ "This is a test", "生活的真谛是", "生活的真谛是[MASK]。", # XXX: This one is problematic because of special tokens # "<s> Something something", ] for string in strings: encoded = tokenizer(string)["input_ids"] encoded2 = new_tokenizer.encode(string).ids assert encoded == encoded2, f"{encoded} != {encoded2}" decoded = tokenizer.decode(encoded) decoded2 = new_tokenizer.decode(encoded2) assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}" ``` The converter + some test script. The test script. Tmp save. Adding Fast tokenizer + tests. Adding the tokenization tests. Correct combination. Small fix. Fixing tests. Fixing with latest update. Rebased. fix copies + normalized added tokens + copies. Adding doc. TMP. Doc + split files. Doc. Versions + try import. Fix Camembert + warnings -> Error. Fix by ArthurZucker. Not a decorator.

* Adding Llama FastTokenizer support. - Requires huggingface/tokenizers#1183 version - Only support byte_fallback for llama, raise otherwise (safety net). - Lots of questions are special tokens How to test: ```python from transformers.convert_slow_tokenizer import convert_slow_tokenizer from transformers import AutoTokenizer from tokenizers import Tokenizer tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b") if False: new_tokenizer = Tokenizer.from_file("tok.json") else: new_tokenizer = convert_slow_tokenizer(tokenizer) new_tokenizer.save("tok.json") strings = [ "This is a test", "生活的真谛是", "生活的真谛是[MASK]。", # XXX: This one is problematic because of special tokens # "<s> Something something", ] for string in strings: encoded = tokenizer(string)["input_ids"] encoded2 = new_tokenizer.encode(string).ids assert encoded == encoded2, f"{encoded} != {encoded2}" decoded = tokenizer.decode(encoded) decoded2 = new_tokenizer.decode(encoded2) assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}" ``` The converter + some test script. The test script. Tmp save. Adding Fast tokenizer + tests. Adding the tokenization tests. Correct combination. Small fix. Fixing tests. Fixing with latest update. Rebased. fix copies + normalized added tokens + copies. Adding doc. TMP. Doc + split files. Doc. Versions + try import. Fix Camembert + warnings -> Error. Fix by ArthurZucker. Not a decorator. * Fixing comments. * Adding more to docstring. * Doc rewriting.

kjtaed · 2023-04-11T09:34:18Z

The "byte_fallback" option does not decompose unknown UTF-8 characters into bytes.

Is there example of a training code using byte_fallback?

tokenizer = Tokenizer(BPE(byte_fallback=True))
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.Digits(individual_digits=True),
    pre_tokenizers.Metaspace(),
])
tokenizer.decoder = decoders.Sequence([
    decoders.Metaspace(),
    decoders.ByteFallback(),
])
trainer = trainers.BpeTrainer(
    vocab_size=vocab_size,
    min_frequency=min_frequency,
    special_tokens=special_tokens,
)
tokenizer.train(files, trainer=trainer)

Narsil · 2023-04-11T09:39:31Z

The "byte_fallback" option does not decompose unknown UTF-8 characters into bytes.

No, it transforms unknown tokens (unk) into it's <0x0a> tokens (if they exist in the vocab) (for each byte in the unk token)

kjtaed · 2023-04-11T10:42:59Z

How do I add byte tokens to vocab?

Narsil · 2023-04-11T11:41:17Z

https://huggingface.co/docs/tokenizers/api/trainers

initial_alphabet=[f'<0x{:02x}>' for x in range(256)] for instance ? (You may need to add more depending on the rest of your setup.

kjtaed · 2023-04-11T23:29:09Z

I tried it earlier but byte token is not added.

the strings contain more than one character, only the first one is kept.
(https://huggingface.co/docs/tokenizers/api/trainers#tokenizers.trainers.UnigramTrainer.initial_alphabet)

I was able to add byte tokens by modifying model file(json file).
however, it was not possible to add byte tokens using tokenizers api.

Narsil · 2023-04-12T08:32:35Z

You are linking the UnigramTrainer here, not the BPE trainer.

Which one are you using ? Unigram bytefallback is not yet supported.

kjtaed · 2023-04-13T00:13:34Z

Sorry.

initial_alphabet (List[str]) — A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.
(https://huggingface.co/docs/tokenizers/api/trainers#tokenizers.trainers.BpeTrainer.initial_alphabet)

Byte tokens(['<0x00>', '<0x01>', ...]) cannot be added with 'initial_alphabet'.

If you have example code for 'training BPE with byte_fallback', can you share it?

DouglasOrr · 2023-04-21T05:54:36Z

If it helps, setting special_tokens=[f"<0x{i:02X}>" for i in range(256)] seems to work.

chris-ha458 · 2023-04-23T01:06:39Z

setting them as special tokens do not allow them to be trained in certain situations.
This can be hand changed later, but there should be a way to either

a way to change a lot of flags for special tokens (changing all 256 of them into special =false)
a way to addtokens that wont be 'lost' (if we use the add token method before training, they may or may not be loss)
change byte_fallback behavior to deal with b'\x80' like notation since each are technically "one character"
when byte_fallback is set, internally add <0x00> ~ <0xFF> tokens into the initial alphabet automatically (what sentencepiece does)

I can think of more, but these seem to be the most effective methods.

* Adding Llama FastTokenizer support. - Requires huggingface/tokenizers#1183 version - Only support byte_fallback for llama, raise otherwise (safety net). - Lots of questions are special tokens How to test: ```python from transformers.convert_slow_tokenizer import convert_slow_tokenizer from transformers import AutoTokenizer from tokenizers import Tokenizer tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b") if False: new_tokenizer = Tokenizer.from_file("tok.json") else: new_tokenizer = convert_slow_tokenizer(tokenizer) new_tokenizer.save("tok.json") strings = [ "This is a test", "生活的真谛是", "生活的真谛是[MASK]。", # XXX: This one is problematic because of special tokens # "<s> Something something", ] for string in strings: encoded = tokenizer(string)["input_ids"] encoded2 = new_tokenizer.encode(string).ids assert encoded == encoded2, f"{encoded} != {encoded2}" decoded = tokenizer.decode(encoded) decoded2 = new_tokenizer.decode(encoded2) assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}" ``` The converter + some test script. The test script. Tmp save. Adding Fast tokenizer + tests. Adding the tokenization tests. Correct combination. Small fix. Fixing tests. Fixing with latest update. Rebased. fix copies + normalized added tokens + copies. Adding doc. TMP. Doc + split files. Doc. Versions + try import. Fix Camembert + warnings -> Error. Fix by ArthurZucker. Not a decorator. * Fixing comments. * Adding more to docstring. * Doc rewriting.

Narsil requested review from McPatate and sgugger March 17, 2023 14:52

This was referenced Mar 17, 2023

Good ideas from llama.cpp rustformers/llm#15

Closed

Implement the Byte->char hack of SPM within BPE #929

Closed

Update rustdoc.

db11669

sgugger approved these changes Mar 17, 2023

View reviewed changes

dongs0104 mentioned this pull request Mar 17, 2023

FastTokenizer for LLaMa huggingface/transformers#22114

Closed

Narsil added 5 commits March 17, 2023 16:22

Clippy + Add BPE(byte_fallback) into bindings.

f6aea71

Stupid file.

cae0261

Test artifacts removed.

8237c56

Update stub.

c75574c

Fix.

80f630e

McPatate approved these changes Mar 17, 2023

View reviewed changes

bindings/python/test.py Outdated Show resolved Hide resolved

Bad file.

8e3c595

setzer22 mentioned this pull request Mar 18, 2023

Use the HuggingFace llama Tokenizer rustformers/llm#35

Closed

Narsil mentioned this pull request Mar 20, 2023

CVE-2023-22466 vuln of the component tokio which is used to compiled in tokenizers. #1184

Closed

CRITICAL FIX: wrapper order because of untagged....

a480f9f

Narsil mentioned this pull request Mar 20, 2023

Adding Llama FastTokenizer support. huggingface/transformers#22264

Merged

5 tasks

Remove prints.

3584036

ollmer mentioned this pull request Mar 21, 2023

added llama huggingface/text-generation-inference#130

Closed

Fixing <16 byte fallback.

a788ef8

Narsil merged commit 73637a0 into main Mar 23, 2023

Narsil deleted the byte_fallback branch March 23, 2023 15:04

xenova mentioned this pull request Apr 28, 2023

[bug] BPE roberta-large-mnli saved with .save_pretrained() incorrectly sets byte_fallback to false (should be true) #1234

Closed

chris-ha458 mentioned this pull request May 1, 2023

Add unigram bytefallback #1217

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding ByteFallback support for `tokenizers`. #1183

Adding ByteFallback support for `tokenizers`. #1183

Narsil commented Mar 17, 2023 •

edited

Loading

sgugger left a comment

HuggingFaceDocBuilderDev commented Mar 17, 2023 •

edited

Loading

theblackcat102 commented Mar 21, 2023

Narsil commented Mar 22, 2023

kjtaed commented Apr 11, 2023 •

edited

Loading

Narsil commented Apr 11, 2023

kjtaed commented Apr 11, 2023

Narsil commented Apr 11, 2023

kjtaed commented Apr 11, 2023 •

edited

Loading

Narsil commented Apr 12, 2023

kjtaed commented Apr 13, 2023 •

edited

Loading

DouglasOrr commented Apr 21, 2023

chris-ha458 commented Apr 23, 2023

Adding ByteFallback support for tokenizers. #1183

Adding ByteFallback support for tokenizers. #1183

Conversation

Narsil commented Mar 17, 2023 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Mar 17, 2023 • edited Loading

theblackcat102 commented Mar 21, 2023

Narsil commented Mar 22, 2023

kjtaed commented Apr 11, 2023 • edited Loading

Narsil commented Apr 11, 2023

kjtaed commented Apr 11, 2023

Narsil commented Apr 11, 2023

kjtaed commented Apr 11, 2023 • edited Loading

Narsil commented Apr 12, 2023

kjtaed commented Apr 13, 2023 • edited Loading

DouglasOrr commented Apr 21, 2023

chris-ha458 commented Apr 23, 2023

Adding ByteFallback support for `tokenizers`. #1183

Adding ByteFallback support for `tokenizers`. #1183

Narsil commented Mar 17, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 17, 2023 •

edited

Loading

kjtaed commented Apr 11, 2023 •

edited

Loading

kjtaed commented Apr 11, 2023 •

edited

Loading

kjtaed commented Apr 13, 2023 •

edited

Loading