-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚠️⚠️[T5Tokenize
] Fix T5 family tokenizers⚠️⚠️
#24565
⚠️⚠️[T5Tokenize
] Fix T5 family tokenizers⚠️⚠️
#24565
Conversation
The documentation is not available anymore as the PR was closed or merged. |
This can also be made non "breakable" with a flag. Up to debate since it is a bug fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! Let's roll with it since it's a bug fix and if people complain about the breaking change we will see if we add a flag to enable the buggy behavior.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Edit: just to make sure, I did more testing and unfortunately , there is one bug: >>>tokenizer.tokenize("Hello <extra_id_0>")
['_', '_Hello', '<extra_id_0>'] instead of >>>tokenizer.tokenize("Hello <extra_id_0>")
['_Hello', '<extra_id_0>'] This is because we have to prepend a |
I'm getting this legacy behaviour warning come up when simply loading a T5 tokenizer - it appears even before using the tokenizer. Is there an updated way to load the tokenizer? The warning appears when running the following lines of code: from transformers import AutoTokenizer The error is: |
Yep, just set |
Update the split function to based on the last occurence of "." since the file path now includes . in MacOs in InstructLab version 0.19. **Issue resolved by this Pull Request:** Resolves #2356 ## Before ```bash ilab model train --pipeline simple ``` ```bash ##Output ᕙ(•̀‸•́‶)ᕗ Training has started! ᕙ(•̀‸•́‶)ᕗ ********* Epoch 1: Iter 1: Val loss 3.961, Val took 40.321s Iter 010: Train loss 3.318, It/sec 0.212, Tokens/sec 88.791 Epoch 1: Iter 10: Val loss 2.477, Val took 41.355s Traceback (most recent call last): File "/Users/ahmedazraq/Documents/instructlab/venv/bin/ilab", line 8, in <module> sys.exit(ilab()) ^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func return f(get_current_context(), *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/clickext.py", line 319, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/model/train.py", line 730, in train load_and_train( File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/train/lora_mlx/lora.py", line 306, in load_and_train train_model( File "/Users/ahmedazraq/Documents/instructlab/venv/lib/python3.11/site-packages/instructlab/train/lora_mlx/lora.py", line 204, in train_model a, b = adapter_file.split(".") ^^^^ ValueError: too many values to unpack (expected 2) ``` ## After ```bash [INFO] Loading Fetching 11 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 4778.60it/s] You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. dtype=mlx.core.float16 [INFO] Quantizing Using model_type='llama' Loading pretrained model Using model_type='llama' Total parameters 1165.829M Trainable parameters 2.097M Loading datasets ********* ᕙ(•̀‸•́‶)ᕗ Training has started! ᕙ(•̀‸•́‶)ᕗ ********* Epoch 1: Iter 1: Val loss 3.957, Val took 40.587s Iter 010: Train loss 3.302, It/sec 0.201, Tokens/sec 84.345 Epoch 1: Iter 10: Val loss 2.450, Val took 43.608s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 10: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-010.npz. Iter 020: Train loss 1.670, It/sec 0.194, Tokens/sec 76.757 Epoch 1: Iter 20: Val loss 1.315, Val took 41.714s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 20: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-020.npz. Iter 030: Train loss 0.902, It/sec 0.167, Tokens/sec 69.851 Epoch 1: Iter 30: Val loss 1.010, Val took 41.833s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 30: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-030.npz. Iter 040: Train loss 0.679, It/sec 0.188, Tokens/sec 76.435 Epoch 1: Iter 40: Val loss 0.861, Val took 42.498s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 40: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-040.npz. Iter 050: Train loss 0.618, It/sec 0.167, Tokens/sec 69.118 Epoch 1: Iter 50: Val loss 0.797, Val took 42.391s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 50: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-050.npz. Iter 060: Train loss 0.442, It/sec 0.184, Tokens/sec 75.080 Epoch 2: Iter 60: Val loss 0.764, Val took 42.659s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 60: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-060.npz. Iter 070: Train loss 0.496, It/sec 0.158, Tokens/sec 65.023 Epoch 2: Iter 70: Val loss 0.704, Val took 42.645s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 70: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-070.npz. Iter 080: Train loss 0.420, It/sec 0.165, Tokens/sec 68.490 Epoch 2: Iter 80: Val loss 0.682, Val took 44.482s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 80: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-080.npz. Iter 090: Train loss 0.422, It/sec 0.147, Tokens/sec 62.503 Epoch 2: Iter 90: Val loss 0.647, Val took 42.198s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 90: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-090.npz. Iter 100: Train loss 0.404, It/sec 0.199, Tokens/sec 79.267 Epoch 2: Iter 100: Val loss 0.631, Val took 42.723s /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters npz Iter 100: Saved adapter weights to /Users/ahmedazraq/.local/share/instructlab/checkpoints/instructlab-granite-7b-lab-mlx-q/adapters-100.npz. ``` **Checklist:** - [x] **Commit Message Formatting**: Commit titles and messages follow guidelines in the [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary). - [ ] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release. - [ ] Documentation has been updated, if necessary. - [x] Unit tests have been added, if necessary. - [x] Functional tests have been added, if necessary. - [x] E2E Workflow tests have been added, if necessary. Approved-by: jaideepr97 Approved-by: bjhargrave
What does this PR do?
Fixes the
T5Tokenizer
(not the fast one yet). (at the same time adresses part of #11531)When converting
UMT5
I created a reproduction snippet for any t5x model form the original repo. I realized that a very very small variation in the input completely changes the output for non-finetuned models. The issue lies with the way we process<extra_id_xx>
.Example:
The reason is that t5x wrapps arround
sentencepiece
, and adds the extra id to the vocab, but they are not saved that way.We don't add them to the vocab, so when we tokenize, we split on special tokens, thus the sentencepiece model only sees:
While the original model never sees a
.
(or a lot of other characters) alone, and thus we add an extra space...This is a bug fix with regards to training, it is breaking in the sense that is should remove the space.
TODO:
tokenizer.encode(". Hello")
as it remove the prefix space that is normally added.