Fix BytePair special tokens tokenization #1447

abuelnasr0 · 2024-02-20T13:20:04Z

BytePair already tokenize special tokens but it was having a small nit explained here #1435
this PR fixes it.

mattdangerw · 2024-02-23T00:49:55Z

Thanks very much @abuelnasr0! Finally freeing up from our Gemma release. I'll try to review #1447, #1445 and #1397 as a set, but just a heads up I'll probably post feedback next week.

In the meantime, if you are looking for something to do, we still need BloomCausalLM. I'm hoping to do some refactoring (#1425) that will make adding generative classes way easier, but no need to block on that.

abuelnasr0 · 2024-02-24T18:05:33Z

@mattdangerw no problem, Take your time. The Gemma release was awesome work from you and the team.
BloomCausalLM is already in my plans, but I was a little bit busy. I started adding it few days ago and I will continue today. may be I will open a PR today.

We will update our samplers in the near future to push the backend specific compilation details out: keras-team#1425 Also in general, we want our documentation to reflect the main usage of our classes, which is using them with Seq2SeqLM and CausalLM classes. So with that in mind, this updates our sampler docs to show the practical usage of the sampling classes with our modeling classes. For the base class, we show the main use case of overriding the `get_next_token()` function.

The Keras implementation of the Gemma model was the effort of a number of contributors: - Initial architecture: Gabriel Rasskin, Francois Chollet, Matt Watson - Model parallelism: Qianli Scott Zhu - Model export for inference: Neel Kovelamudi - Lora implementation: Francois Chollet, Samaneh Saadat - Benchmarking: Haifeng Jin - Intepretability extensions: Ryan Mullins - Testing infrastructure: Ramesh Sampath Many more helped with documentaiton and Kaggle integration. Co-authored-by: Francois Chollet <francois.chollet@gmail.com> Co-authored-by: Gabriel Rasskin <43894452+grasskin@users.noreply.github.com> Co-authored-by: Qianli Scott Zhu <scottzhu@google.com> Co-authored-by: Neel Kovelamudi <60985914+nkovela1@users.noreply.github.com> Co-authored-by: Samaneh Saadat <ssaadat@google.com> Co-authored-by: Haifeng Jin <5476582+haifeng-jin@users.noreply.github.com> Co-authored-by: Ramesh Sampath <1437573+sampathweb@users.noreply.github.com> Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>

Includes some small cleanups for the Kaggle assets.

…as-team#1471) * Add docstring for conversion script install instructions * Add docstring to verification script * Change wording

We can skip these by default, for users who have not yet set them up. We will need to set them up for CI, see keras-team#1459

…team#1460)

0.8 is out! We can consider our master branch an 0.9 preview.

Hi wonderful Keras folks, I was browsing the new Gemma source and noticed that the RMSNorm code didn't use the epsilon parameter it takes in. This fixes that. While we're here, I'm curious what drove the 1+scale multiplier (instead of just initializing scale to 1). Would love to learn if you're down to share. Thanks, Chris (ex-Googler)

* Add Falcon backbone. * Add docstring. * Add dtype. * Add checkpoint conversion script. * Fix tests. * Random fixes. * Add cache. * Cast cumsum to int32. * Make sublayers public. * Address backbone comments. * Update attention computation to use einsum. * Falcon only works with Keras3. * Fix tests. * Remove falcon_causal_lm file. * Remove commented/unused codes.

* CI - Add kaggle creds to pull model * add kaggle env variables * Kaggle env: * Kaggle env: * Kaggle env: * Kaggle env: * Update Build script for Kokoro * Add Kaggle env var * set gemma preset to extra_large * Change Gemma small preset to bfloat16 * Change Gemma small preset to xlarge

Fixes keras-team#1481

Fixes keras-team#1446

* Fix dtype accessors of tasks/backbones * Address comments, minor fixes

This reverts commit 97c3413.

* Docs(layers): add a description for `tie_weights` argument * Refactor(layers): make `name` an explicit argument for Transformer layers * Refactor(layers): remove explicit usage of `name` in `__init__` calls * Docs(layers): remove references to `name` and consistently documents `**kwargs`

…s-team#1397) * Support tokenization of special tokens for word_piece_tokenizer * Add the feature to models tokenizers * Format the code * Fix Fromat * Small fixes * Add tests for bert * Add tests for distilbert * Small fix for bert test * Add tests for electra * Fix code format * Rename unsplittable to special * Edit special_tokens Arg * Format the code * Move special tokens checking into base class * Add special_tokens_in_strings Arg * Shorten comments * Shorten comments * Shorten the logic og splitting and add comments * Code format

* Initial Kaggle upload. * Address review comments. * Add upload valiations. * Address review comments. * Fix init. * Address review comments. * Improve error handling. * Address review comments.

* Add scoring mode to MistralCausalLM * Fixing names in Docstring * Fix padding mask arg name * Fix embedded shape in test * Remove errant underscore in Docstring

* Add Kaggle upload validation tests. * Use bert_tiny as test model.

…1384) * Added ElectraBackbone * Added backbone tests for ELECTRA * Fix config * Add model import to __init__ * add electra tokenizer * add tests for tokenizer * add __init__ file * add tokenizer and backbone to models __init__ * Fix Failing tokenization test * Add example on usage of the tokenizer with custom vocabulary * Add conversion script to convert weights from checkpoint * Add electra preprocessor * Add presets and tests * Add presets config with model weights * Add checkpoint conversion script * Name conversion for electra models * Update naming conventions according to preset names * Fix failing tokenizer tests * Update checkpoint conversion script according to kaggle * Add validate function * Kaggle preset * update preset link * Add electra presets * Complete run_small_preset test for electra * Add large variations of electra in presets * Fix case issues with electra presets * Fix format --------- Co-authored-by: Matt Watson <mattdangerw@gmail.com>

* first draft * update upload_preset * lint * consistent error messages * lint

* Add multitoken stopping * Update gemma_causal_lm.py * Add further multitoken support * Formatting * Revert tokenizer changes * Move multi token stop to generative task * None check * None check * Error message * Add stop_token_ids * Util testing * Fix sampler tests * All multitoken stop to all models * Sampler multi token * Formatting * Tuple required * Tuple docstring * Pytorch GPU fix * Numpy fix

* Add lora example to GemmaCausalLM docstring. * Address review.

* Add LLaMA Causal LM * Add causal lm to the public API * Update preset names and fix checkpoint script * Fix discrepancies and add tests * Add tests for CausalLM * end_token -> stop_token_ids

This PR grew as I was writing it, and now adds a number of new features: 1. Exposed base classes. Sets us on a path for better documentation, a more "introspectable" library, and allow sub-classing. 2. Enable `from_preset()` on base classes for any subclass preset. This gives us similar functionality to "auto classes" in huggingface, without the extra overhead of needing a new symbol. 3. An ability to register new tasks/backbones/tokenizers from out of tree code with `keras.saving.register_keras_serializable()`. Try a colab: https://colab.research.google.com/gist/mattdangerw/da885f050fa8baef9b4f9a4ec68d6567/kerasnlp-base-classes.ipynb

* Run the LLaMA RMS Layer Norm in float32 * Also use float32 in Mistral Layer Norm * Address review comments - Change private variables to public vars - Change `self._weight` to `self.scale` - Don't persist the input dim - Move the var computation to its own line for readability * Change weights to scale in layer norm

* Adds score API to GPT-2 * Addressing reviewer comments

…1535)

…s-team#1523) * Implement compute_output_spec() for tokenizers with vocabulary. (restarted from new point in master branch) * Remove type annotation from compute_output_spec() in tokenizers

Currently Keras as a whole is not doing type annotiations, but we still have a few stragglers. Removing them as they occasionally cause confusion.

…am#1540) * Fix discrepency between HF LLaMA and our implementation * Fix Mistral transformer decoder

Bumps the python group with 2 updates: torch and torchvision. Updates `torch` from 2.2.1+cu121 to 2.2.2+cu121 Updates `torchvision` from 0.17.1+cu121 to 0.17.2+cu121 --- updated-dependencies: - dependency-name: torch dependency-type: direct:production update-type: version-update:semver-patch dependency-group: python - dependency-name: torchvision dependency-type: direct:production update-type: version-update:semver-patch dependency-group: python ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

abuelnasr0 added 6 commits February 20, 2024 15:16

Fix BP special tokens tokenization

91ff01d

Add test to Bart

6c26d84

Add test to Bloom

b4769ea

Remove ToDo comment

814836b

Add tests for Roberta

f451351

Remove roberta todo comment

464555c

abuelnasr0 mentioned this pull request Feb 20, 2024

Support tokenization of special tokens for word_piece_tokenizer #1397

Merged

Fix split comment

ab7b48a

mattdangerw mentioned this pull request Feb 21, 2024

Add CLIP to KerasCV keras-team/keras-cv#2331

Merged

8 tasks

mattdangerw self-requested a review February 23, 2024 00:45

mattdangerw and others added 18 commits April 2, 2024 22:16

Update to the newest version of Gemma on Kaggle (keras-team#1454)

f75d8cb

Includes some small cleanups for the Kaggle assets.

Add dtype arg to Gemma HF conversion script (keras-team#1452)

cd5e33c

Fix gemma testing import (keras-team#1462)

e2624a1

Add docstring for PyTorch conversion script install instructions (ker…

4a0adf2

…as-team#1471) * Add docstring for conversion script install instructions * Add docstring to verification script * Change wording

Add an annotation to tests that need kaggle auth (keras-team#1470)

6c642c8

We can skip these by default, for users who have not yet set them up. We will need to set them up for CI, see keras-team#1459

Fix Mistral memory consumption with JAX and default dtype bug (keras-…

4ba3ca7

…team#1460)

Bump the master version to 0.9 (keras-team#1473)

5d22424

0.8 is out! We can consider our master branch an 0.9 preview.

Pin to TF 2.16 RC0 (keras-team#1478)

3db86d1

Update reversible_embedding.py (keras-team#1484)

134f8b7

doc fix for constrastive sampler (keras-team#1488)

c1b6b54

Fixes keras-team#1481

Remove broken link to masking and padding guide (keras-team#1487)

f3eda3c

Fixes keras-team#1446

Fix a typo. (keras-team#1489)

7f692ca

Fix dtype accessors of tasks/backbones (keras-team#1486)

8851624

* Fix dtype accessors of tasks/backbones * Address comments, minor fixes

sachinprasadhs and others added 26 commits April 2, 2024 22:16

Unify docstring style

d1031df

Revert "Unify docstring style"

2acb4c9

This reverts commit 97c3413.

Standardize docstring (keras-team#1516)

5944635

Upload Model to Kaggle (keras-team#1512)

c3b2c09

* Initial Kaggle upload. * Address review comments. * Add upload valiations. * Address review comments. * Fix init. * Address review comments. * Improve error handling. * Address review comments.

Add scoring mode to MistralCausalLM (keras-team#1521)

eb4ef20

* Add scoring mode to MistralCausalLM * Fixing names in Docstring * Fix padding mask arg name * Fix embedded shape in test * Remove errant underscore in Docstring

Add Mistral Instruct V0.2 preset (keras-team#1520)

f1714e1

Add Tests for Kaggle Upload Validation (keras-team#1524)

6703d76

* Add Kaggle upload validation tests. * Use bert_tiny as test model.

Allow saving / loading from Huggingface Hub preset (keras-team#1510)

6ea1e63

* first draft * update upload_preset * lint * consistent error messages * lint

Update mistral_tokenizer.py (keras-team#1528)

e5b2833

Add lora example to GemmaCausalLM docstring (keras-team#1527)

2be333c

* Add lora example to GemmaCausalLM docstring. * Address review.

Add LLaMA Causal LM with 7B presets (keras-team#1526)

859b1bf

* Add LLaMA Causal LM * Add causal lm to the public API * Update preset names and fix checkpoint script * Fix discrepancies and add tests * Add tests for CausalLM * end_token -> stop_token_ids

Doc fixes (keras-team#1530)

db831d7

Adds score API to GPT-2 (keras-team#1533)

1192db4

* Adds score API to GPT-2 * Addressing reviewer comments

increase pip timeout to 1000s to avoid connection resets (keras-team#…

035a776

…1535)

Adds the score API to LlamaCausalLM (keras-team#1534)

298e15c

Implement compute_output_spec() for tokenizers with vocabulary. (kera…

91aa654

…s-team#1523) * Implement compute_output_spec() for tokenizers with vocabulary. (restarted from new point in master branch) * Remove type annotation from compute_output_spec() in tokenizers

Remove staggler type annotiations (keras-team#1536)

d95c271

Currently Keras as a whole is not doing type annotiations, but we still have a few stragglers. Removing them as they occasionally cause confusion.

Always run SiLU activation in float32 for LLaMA and Mistral (keras-te…

dcebc7c

…am#1540) * Fix discrepency between HF LLaMA and our implementation * Fix Mistral transformer decoder

Add special_tokens_in_strings to byte_pair_tokenizer

d0ff826

abuelnasr0 force-pushed the BP_tokenizer branch from 74c3557 to d0ff826 Compare April 2, 2024 22:26

abuelnasr0 closed this Apr 2, 2024

abuelnasr0 deleted the BP_tokenizer branch April 2, 2024 22:36

abuelnasr0 mentioned this pull request Apr 2, 2024

Add special_tokens_in_strings Arg to byte_pair_tokenizer. #1546

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BytePair special tokens tokenization #1447

Fix BytePair special tokens tokenization #1447

abuelnasr0 commented Feb 20, 2024

mattdangerw commented Feb 23, 2024

abuelnasr0 commented Feb 24, 2024

Fix BytePair special tokens tokenization #1447

Fix BytePair special tokens tokenization #1447

Conversation

abuelnasr0 commented Feb 20, 2024

mattdangerw commented Feb 23, 2024

abuelnasr0 commented Feb 24, 2024