Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BytePair special tokens tokenization #1447

Closed
wants to merge 70 commits into from

Conversation

abuelnasr0
Copy link
Contributor

BytePair already tokenize special tokens but it was having a small nit explained here #1435
this PR fixes it.

@mattdangerw
Copy link
Member

Thanks very much @abuelnasr0! Finally freeing up from our Gemma release. I'll try to review #1447, #1445 and #1397 as a set, but just a heads up I'll probably post feedback next week.

In the meantime, if you are looking for something to do, we still need BloomCausalLM. I'm hoping to do some refactoring (#1425) that will make adding generative classes way easier, but no need to block on that.

@abuelnasr0
Copy link
Contributor Author

@mattdangerw no problem, Take your time. The Gemma release was awesome work from you and the team.
BloomCausalLM is already in my plans, but I was a little bit busy. I started adding it few days ago and I will continue today. may be I will open a PR today.

mattdangerw and others added 18 commits April 2, 2024 22:16
We will update our samplers in the near future to push the backend
specific compilation details out: keras-team#1425

Also in general, we want our documentation to reflect the main usage of
our classes, which is using them with Seq2SeqLM and CausalLM classes.

So with that in mind, this updates our sampler docs to show the
practical usage of the sampling classes with our modeling classes. For
the base class, we show the main use case of overriding the
`get_next_token()` function.
The Keras implementation of the Gemma model was the effort of a number of
contributors:

- Initial architecture: Gabriel Rasskin, Francois Chollet, Matt Watson
- Model parallelism: Qianli Scott Zhu
- Model export for inference: Neel Kovelamudi
- Lora implementation: Francois Chollet, Samaneh Saadat
- Benchmarking: Haifeng Jin
- Intepretability extensions: Ryan Mullins
- Testing infrastructure: Ramesh Sampath

Many more helped with documentaiton and Kaggle integration.

Co-authored-by: Francois Chollet <francois.chollet@gmail.com>
Co-authored-by: Gabriel Rasskin <43894452+grasskin@users.noreply.github.com>
Co-authored-by: Qianli Scott Zhu <scottzhu@google.com>
Co-authored-by: Neel Kovelamudi <60985914+nkovela1@users.noreply.github.com>
Co-authored-by: Samaneh Saadat <ssaadat@google.com>
Co-authored-by: Haifeng Jin <5476582+haifeng-jin@users.noreply.github.com>
Co-authored-by: Ramesh Sampath <1437573+sampathweb@users.noreply.github.com>
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
Includes some small cleanups for the Kaggle assets.
…as-team#1471)

* Add docstring for conversion script install instructions

* Add docstring to verification script

* Change wording
We can skip these by default, for users who have not yet set them up.
We will need to set them up for CI, see
keras-team#1459
0.8 is out! We can consider our master branch an 0.9 preview.
Hi wonderful Keras folks,

I was browsing the new Gemma source and noticed that the RMSNorm code didn't use the epsilon parameter it takes in. This fixes that.

While we're here, I'm curious what drove the 1+scale multiplier (instead of just initializing scale to 1). Would love to learn if you're down to share.

Thanks,
Chris
(ex-Googler)
* Add Falcon backbone.

* Add docstring.

* Add dtype.

* Add checkpoint conversion script.

* Fix tests.

* Random fixes.

* Add cache.

* Cast cumsum to int32.

* Make sublayers public.

* Address backbone comments.

* Update attention computation to use einsum.

* Falcon only works with Keras3.

* Fix tests.

* Remove falcon_causal_lm file.

* Remove commented/unused codes.
* CI - Add kaggle creds to pull model

* add kaggle env variables

* Kaggle env:

* Kaggle env:

* Kaggle env:

* Kaggle env:

* Update Build script for Kokoro

* Add Kaggle env var

* set gemma preset to extra_large

* Change Gemma small preset to bfloat16

* Change Gemma small preset to xlarge
* Fix dtype accessors of tasks/backbones

* Address comments, minor fixes
sachinprasadhs and others added 26 commits April 2, 2024 22:16
* Docs(layers): add a description for `tie_weights` argument

* Refactor(layers): make `name` an explicit argument for Transformer layers

* Refactor(layers): remove explicit usage of `name` in `__init__` calls

* Docs(layers): remove references to `name` and consistently documents `**kwargs`
…s-team#1397)

* Support tokenization of special tokens for word_piece_tokenizer

* Add the feature to models tokenizers

* Format the code

* Fix Fromat

* Small fixes

* Add tests for bert

* Add tests for distilbert

* Small fix for bert test

* Add tests for electra

* Fix code format

* Rename unsplittable to special

* Edit special_tokens Arg

* Format the code

* Move special tokens checking into base class

* Add special_tokens_in_strings Arg

* Shorten comments

* Shorten comments

* Shorten the logic og splitting and add comments

* Code format
* Initial Kaggle upload.

* Address review comments.

* Add upload valiations.

* Address review comments.

* Fix init.

* Address review comments.

* Improve error handling.

* Address review comments.
* Add scoring mode to MistralCausalLM

* Fixing names in Docstring

* Fix padding mask arg name

* Fix embedded shape in test

* Remove errant underscore in Docstring
* Add Kaggle upload validation tests.

* Use bert_tiny as test model.
…1384)

* Added ElectraBackbone

* Added backbone tests for ELECTRA

* Fix config

* Add model import to __init__

* add electra tokenizer

* add tests for tokenizer

* add __init__ file

* add tokenizer and backbone to models __init__

* Fix Failing tokenization test

* Add example on usage of the tokenizer with custom vocabulary

* Add conversion script to convert weights from checkpoint

* Add electra preprocessor

* Add presets and tests

* Add presets config with model weights

* Add checkpoint conversion script

* Name conversion for electra models

* Update naming conventions according to preset names

* Fix failing tokenizer tests

* Update checkpoint conversion script according to kaggle

* Add validate function

* Kaggle preset

* update preset link

* Add electra presets

* Complete run_small_preset test for electra

* Add large variations of electra in presets

* Fix case issues with electra presets

* Fix format

---------

Co-authored-by: Matt Watson <mattdangerw@gmail.com>
* first draft

* update upload_preset

* lint

* consistent error messages

* lint
* Add multitoken stopping

* Update gemma_causal_lm.py

* Add further multitoken support

* Formatting

* Revert tokenizer changes

* Move multi token stop to generative task

* None check

* None check

* Error message

* Add stop_token_ids

* Util testing

* Fix sampler tests

* All multitoken stop to all models

* Sampler multi token

* Formatting

* Tuple required

* Tuple docstring

* Pytorch GPU fix

* Numpy fix
* Add lora example to GemmaCausalLM docstring.

* Address review.
* Add LLaMA Causal LM

* Add causal lm to the public API

* Update preset names and fix checkpoint script

* Fix discrepancies and add tests

* Add tests for CausalLM

* end_token -> stop_token_ids
This PR grew as I was writing it, and now adds a number of new features:

1. Exposed base classes. Sets us on a path for better documentation,
   a more "introspectable" library, and allow sub-classing.
2. Enable `from_preset()` on base classes for any subclass preset. This
   gives us similar functionality to "auto classes" in huggingface,
   without the extra overhead of needing a new symbol.
3. An ability to register new tasks/backbones/tokenizers from out of
   tree code with `keras.saving.register_keras_serializable()`.

Try a colab:
https://colab.research.google.com/gist/mattdangerw/da885f050fa8baef9b4f9a4ec68d6567/kerasnlp-base-classes.ipynb
* Run the LLaMA RMS Layer Norm in float32

* Also use float32 in Mistral Layer Norm

* Address review comments

- Change private variables to public vars
- Change `self._weight` to `self.scale`
- Don't persist the input dim
- Move the var computation to its own line for readability

* Change weights to scale in layer norm
* Adds score API to GPT-2

* Addressing reviewer comments
…s-team#1523)

* Implement compute_output_spec() for tokenizers with vocabulary. (restarted from new point in master branch)

* Remove type annotation from compute_output_spec() in tokenizers
Currently Keras as a whole is not doing type annotiations, but we still
have a few stragglers. Removing them as they occasionally cause
confusion.
…am#1540)

* Fix discrepency between HF LLaMA and our implementation

* Fix Mistral transformer decoder
Bumps the python group with 2 updates: torch and torchvision.


Updates `torch` from 2.2.1+cu121 to 2.2.2+cu121

Updates `torchvision` from 0.17.1+cu121 to 0.17.2+cu121

---
updated-dependencies:
- dependency-name: torch
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: python
- dependency-name: torchvision
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: python
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.