⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

ArthurZucker · 2023-06-29T02:10:16Z

What does this PR do?

Fixes the T5Tokenizer (not the fast one yet). (at the same time adresses part of #11531)
When converting UMT5 I created a reproduction snippet for any t5x model form the original repo. I realized that a very very small variation in the input completely changes the output for non-finetuned models. The issue lies with the way we process <extra_id_xx>.

Example:

# t5-base tokenizer
>>> tokenizer.encode("<extra_id_0>. Hello", add_special_tokens = False)
[32099, 3, 5, 8774] # ['<extra_id_0>', ' ▁', '.', '▁Hello']
# seqio.SentencePieceVocabulary(vocab_path, extra_ids = 300)
>>> processor.encode("<extra_id_0>. Hello")
[32099, 5, 8774] # ['<extra_id_0>', '.', '▁Hello']

#after fix: 
>>> tokenizer.encode("<extra_id_0>. Hello", add_special_tokens = False)
[32099, 5, 8774] # ['<extra_id_0>', '.', '▁Hello']

The reason is that t5x wrapps arround sentencepiece, and adds the extra id to the vocab, but they are not saved that way.
We don't add them to the vocab, so when we tokenize, we split on special tokens, thus the sentencepiece model only sees:

>>> tokenizer.sp_model.encode(". Hello")
[273, 274, 9]

While the original model never sees a . (or a lot of other characters) alone, and thus we add an extra space...

This is a bug fix with regards to training, it is breaking in the sense that is should remove the space.

TODO:

Extra checks should be added to make sure this does not add anything else (like stripping a . This for example would break: tokenizer.encode(". Hello") as it remove the prefix space that is normally added.

HuggingFaceDocBuilderDev · 2023-06-29T02:34:11Z

The documentation is not available anymore as the PR was closed or merged.

ArthurZucker · 2023-06-29T03:29:25Z

Actually switch t5 tests have to be updated!
This means I have to check if the models were trained with this extra token (if they used HF tokenizer) or not.

tests.models.instructblip.test_modeling_instructblip.InstructBlipModelIntegrationTest testMethod=test_inference_flant5_xl failing on main too so not related.....

tests.models.mt5.test_modeling_flax_mt5.MT5IntegrationTest also fails on main...
tests/models/t5/test_tokenization_t5.py the issue comes from the convert_slow modification. Need to investigate
- [ ] tests/models/t5/test_tokenization_t5.py:399 T5TokenizationTest.test_get_sentinel_token_ids_for_fasttokenizer
- [ ] tests/test_tokenization_common.py:3425 T5TokenizationTest.test_save_pretrained
- [ ] tests/models/t5/test_tokenization_t5.py:271 T5TokenizationTest.test_special_tokens_initialization

ArthurZucker · 2023-06-29T06:47:39Z

This can also be made non "breakable" with a flag. Up to debate since it is a bug fix.

sgugger

Thanks for the fix! Let's roll with it since it's a bug fix and if people complain about the breaking change we will see if we add a flag to enable the buggy behavior.

src/transformers/models/t5/tokenization_t5.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

ArthurZucker · 2023-07-02T03:07:29Z

Edit: just to make sure, I did more testing and unfortunately , there is one bug:

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_', '_Hello', '<extra_id_0>']

instead of

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_Hello', '<extra_id_0>']

This is because we have to prepend a _ instead of a space. (text = SPIECE_UNDERLINE + text. Not a single test caught this when runing pytest tests -k t5 which is interesting.
Fixing asap and adding tests. This is becoming very complex 😓

pointonjoel · 2023-07-20T11:34:28Z

I'm getting this legacy behaviour warning come up when simply loading a T5 tokenizer - it appears even before using the tokenizer. Is there an updated way to load the tokenizer? The warning appears when running the following lines of code:

from transformers import AutoTokenizer
tokeniser = AutoTokenizer.from_pretrained("google/mt5-small")

The error is:
You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at #24565
/usr/local/lib/python3.10/dist-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
warnings.warn(

ArthurZucker · 2023-07-20T13:55:24Z

Yep, just set legacy=False. The goal of the warning is for you to decide wether or not you thing the legacy behaviour is alright with you or not.

Update the split function to based on the last occurence of "." since the file path now includes . in MacOs in InstructLab version 0.19. **Issue resolved by this Pull Request:** Resolves #2356 ## Before ```bash ilab model train --pipeline simple ``` ```bash ##Output ᕙ(•̀‸•́‶)ᕗ Training has started! ᕙ(•̀‸•́‶)ᕗ ********* Epoch 1: Iter 1: Val loss 3.961, Val took 40.321s Iter 010: Train loss 3.318, It/sec 0.212, Tokens/sec 88.791 Epoch 1: Iter 10: Val loss 2.477, Val took 41.355s Traceback (most recent call last): File "/Users/ahmedazraq/Documents/instructlab/venv/bin/ilab", line 8, in <module> sys.exit(ilab()) ^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func return f(get_current_context(), *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/clickext.py", line 319, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/model/train.py", line 730, in train load_and_train( File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/train/lora_mlx/lora.py", line 306, in load_and_train train_model( File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/train/lora_mlx/lora.py", line 204, in train_model a, b = adapter_file.split(".") ^^^^ ValueError: too many values to unpack (expected 2) ``` ## After ```bash [INFO] Loading Fetching 11 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 4778.60it/s] You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. dtype=mlx.core.float16 [INFO] Quantizing Using model_type='llama' Loading pretrained model Using model_type='llama' Total parameters 1165.829M Trainable parameters 2.097M Loading datasets ********* ᕙ(•̀‸•́‶)ᕗ Training has started! ᕙ(•̀‸•́‶)ᕗ ********* Epoch 1: Iter 1: Val loss 3.957, Val took 40.587s Iter 010: Train loss 3.302, It/sec 0.201, Tokens/sec 84.345 Epoch 1: Iter 10: Val loss 2.450, Val took 43.608s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 10: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-010.npz. Iter 020: Train loss 1.670, It/sec 0.194, Tokens/sec 76.757 Epoch 1: Iter 20: Val loss 1.315, Val took 41.714s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 20: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-020.npz. Iter 030: Train loss 0.902, It/sec 0.167, Tokens/sec 69.851 Epoch 1: Iter 30: Val loss 1.010, Val took 41.833s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 30: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-030.npz. Iter 040: Train loss 0.679, It/sec 0.188, Tokens/sec 76.435 Epoch 1: Iter 40: Val loss 0.861, Val took 42.498s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 40: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-040.npz. Iter 050: Train loss 0.618, It/sec 0.167, Tokens/sec 69.118 Epoch 1: Iter 50: Val loss 0.797, Val took 42.391s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 50: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-050.npz. Iter 060: Train loss 0.442, It/sec 0.184, Tokens/sec 75.080 Epoch 2: Iter 60: Val loss 0.764, Val took 42.659s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 60: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-060.npz. Iter 070: Train loss 0.496, It/sec 0.158, Tokens/sec 65.023 Epoch 2: Iter 70: Val loss 0.704, Val took 42.645s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 70: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-070.npz. Iter 080: Train loss 0.420, It/sec 0.165, Tokens/sec 68.490 Epoch 2: Iter 80: Val loss 0.682, Val took 44.482s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 80: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-080.npz. Iter 090: Train loss 0.422, It/sec 0.147, Tokens/sec 62.503 Epoch 2: Iter 90: Val loss 0.647, Val took 42.198s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 90: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-090.npz. Iter 100: Train loss 0.404, It/sec 0.199, Tokens/sec 79.267 Epoch 2: Iter 100: Val loss 0.631, Val took 42.723s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 100: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-100.npz. ``` **Checklist:** - [x] **Commit Message Formatting**: Commit titles and messages follow guidelines in the [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary). - [ ] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release. - [ ] Documentation has been updated, if necessary. - [x] Unit tests have been added, if necessary. - [x] Functional tests have been added, if necessary. - [x] E2E Workflow tests have been added, if necessary. Approved-by: jaideepr97 Approved-by: bjhargrave

ArthurZucker added 2 commits June 29, 2023 01:57

don't add space before single letter chars that don't have a merge

e03a768

fix the fix

76d6ab3

ArthurZucker added 4 commits June 29, 2023 02:35

fixup

5a7184b

add a test

baac7be

more testing

6e37601

fixup

b933328

ArthurZucker mentioned this pull request Jun 29, 2023

Adding custom tokens makes the T5Tokenizer always strip spaces #11531

Closed

4 tasks

hack to make sure fast is also fixed

d0cbc49

ArthurZucker marked this pull request as ready for review June 29, 2023 03:28

ArthurZucker added 2 commits June 29, 2023 04:36

update switch transformers test

50008ed

revert convert slow

5edf863

ArthurZucker requested review from Narsil and sgugger June 29, 2023 06:09

sgugger approved these changes Jun 29, 2023

View reviewed changes

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

ArthurZucker and others added 3 commits June 30, 2023 03:54

Update src/transformers/models/t5/tokenization_t5.py

17bda2c

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

add typechecking

059999e

quality

8d3f2a2

ArthurZucker merged commit b52a03c into huggingface:main Jun 30, 2023

dtiarks mentioned this pull request Jun 30, 2023

Add UDOP #22940

Merged

4 tasks

This was referenced Jul 2, 2023

[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

Merged

LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

Closed

hy928302776 mentioned this pull request Jul 13, 2023

启动模型 bash ./scripts/infer.sh 异常 jerry1993-tech/Cornucopia-LLaMA-Fin-Chinese#10

Open

ashercn97 mentioned this pull request Jul 18, 2023

Confusing Error Message (i think). axolotl-ai-cloud/axolotl#290

Closed

atillabasaran mentioned this pull request Jul 18, 2023

Legacy tokenizer artidoro/qlora#212

Open

ahmed-azraq mentioned this pull request Oct 1, 2024

ilab model train --pipeline simple, on v0.19 fails on Mac with ValueError: too many values to unpack instructlab/instructlab#2356

Closed

Wolchenok57 mentioned this pull request Oct 3, 2024

NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device. kohya-ss/sd-scripts#1665

Open

SidneyLann mentioned this pull request Oct 3, 2024

Fine tune and infer llama3 with cpu unslothai/unsloth#1037

Open

This was referenced Oct 6, 2024

Are there any available tools that can convert the original .pth to safetensors meta-llama/llama-stack#191

Open

RuntimeError: Internal: could not parse ModelProto from /Data_disk/meta_llama/meta_llama3.2/Llama3.2-1B-Instruct/tokenizer.model #34017

Closed

lvjin521 mentioned this pull request Oct 12, 2024

TypeError: NoneType takes no arguments aredden/flux-fp8-api#25

Closed

bolli20000 mentioned this pull request Oct 12, 2024

Cannot resume finetuning flux checkpoint - resume option cannot find input path, but input path is correct ###kohya_ss GUI release v24.2.0### kohya-ss/sd-scripts#1692

Closed

Garibelhj mentioned this pull request Oct 14, 2024

When using img_feats, a gradient explosion issue occurs. amazon-science/mm-cot#80

Open

shubhamgajbhiye1994 mentioned this pull request Oct 16, 2024

[Installation]: issue with docker setup vllm-project/vllm#9420

Open

1 task

alpemreacar mentioned this pull request Oct 28, 2024

[Installation&Bug]: First example. TypeError: CommonMetadataBuilder.build() missing 1 required positional argument: 'block_state' IsaacRe/vllm-kvcompress#2

Open

1 task

This was referenced Oct 29, 2024

(sd3-Flux) returned non-zero exit status 3221225477. 13:02:14-584959 INFO Training has ended kohya-ss/sd-scripts#1737

Closed

(Flux) Dreambooth trained model generates noise bmaltais/kohya_ss#2939

Open

DataJuggler mentioned this pull request Nov 1, 2024

Incompatible dimensions Stability-AI/sd3.5#13

Open

Hibiki82 mentioned this pull request Nov 7, 2024

Issue on training Flux1 LORA, and Fine-Tuning pretrained model. kohya-ss/sd-scripts#1767

Open

Forest-Person mentioned this pull request Nov 8, 2024

KeyError: 'messages' with ilab model train --data-path...... instructlab/instructlab#2600

Closed

acsankar mentioned this pull request Nov 10, 2024

ilab train - stopped but not seeing error and cause instructlab/training#327

Open

Jinrusui mentioned this pull request Nov 10, 2024

TypeError: argument of type 'NoneType' is not iterable hiyouga/LLaMA-Factory#5979

Closed

1 task

chimelea666 mentioned this pull request Nov 16, 2024

AttributeError: 'NoneType' object has no attribute 'to'。 How should I do? Thanks a lot. Akegarasu/lora-scripts#580

Closed

wxiaosi mentioned this pull request Nov 19, 2024

info Training Complete. Check the outputs folder for the LoRA files. The fluxgym program didn't run for 1 minute, and the process ended. I don't know where the problem is. Please help me take a look. cocktailpeanut/fluxgym#247

Open

Sami2609 mentioned this pull request Nov 22, 2024

OSError: Error no file named magic-quill/MagicQuill#50

Closed

2 tasks

Xamexer mentioned this pull request Dec 1, 2024

Stuck at Loading checkpoint shards 100% dvlab-research/LISA#166

Closed

ningbende mentioned this pull request Dec 3, 2024

图片是黑色的，哪里出错了？ shallowdream204/DreamClear#22

Closed

kanebay mentioned this pull request Dec 3, 2024

4090 x8 容器环境推理 DialogGen 量化报错 Tencent/HunyuanDiT#218

Open

Deep-imagelab mentioned this pull request Dec 11, 2024

推理代码是需要多少张gpu显卡才能跑通的呀？ shallowdream204/DreamClear#24

Closed

KillyTheNetTerminal mentioned this pull request Dec 16, 2024

Please help magic-quill/ComfyUI_MagicQuill#16

Closed

shifulegend mentioned this pull request Dec 20, 2024

Fluxgym gets stuck in Google Colab Pro even with 4 images post an error regarding autocast pinokiofactory/factory#86

Open

AshishMulupuri mentioned this pull request Dec 24, 2024

unable to convert llama 3.3 weights to hf.py #35326

Open

4 tasks

perobueno mentioned this pull request Dec 26, 2024

Compile node not working anymore.. zer0int/ComfyUI-HunyuanVideo-Nyan#8

Closed

KhangLoveCat mentioned this pull request Jan 2, 2025

Idk what to name this title but can you show me how to fix this kimjammer/Neuro#27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

ArthurZucker commented Jun 29, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jun 29, 2023

sgugger left a comment

ArthurZucker commented Jul 2, 2023 •

edited

Loading

pointonjoel commented Jul 20, 2023

ArthurZucker commented Jul 20, 2023

⚠️⚠️[T5Tokenize] Fix T5 family tokenizers⚠️⚠️ #24565

⚠️⚠️[T5Tokenize] Fix T5 family tokenizers⚠️⚠️ #24565

Conversation

ArthurZucker commented Jun 29, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jun 29, 2023 • edited Loading

ArthurZucker commented Jun 29, 2023 • edited Loading

ArthurZucker commented Jun 29, 2023

sgugger left a comment

Choose a reason for hiding this comment

ArthurZucker commented Jul 2, 2023 • edited Loading

pointonjoel commented Jul 20, 2023

ArthurZucker commented Jul 20, 2023

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ #24565

ArthurZucker commented Jun 29, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jul 2, 2023 •

edited

Loading