Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kaggle Presets #1365

Merged
merged 27 commits into from
Jan 4, 2024
Merged

Kaggle Presets #1365

merged 27 commits into from
Jan 4, 2024

Conversation

sampathweb
Copy link
Collaborator

No description provided.

@sampathweb sampathweb added the kokoro:force-run Runs Tests on GPU label Dec 12, 2023
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 12, 2023
@nkovela1 nkovela1 added the kokoro:force-run Runs Tests on GPU label Dec 12, 2023
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 12, 2023
@sampathweb sampathweb added the kokoro:force-run Runs Tests on GPU label Dec 13, 2023
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 13, 2023
Copy link
Collaborator

@nkovela1 nkovela1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your help on the rebase and fixes here Ramesh! LGTM!

Excited for the Kaggle integration and stable release!

@nkovela1 nkovela1 added the kokoro:force-run Runs Tests on GPU label Dec 15, 2023
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 15, 2023
@nkovela1 nkovela1 added the kokoro:force-run Runs Tests on GPU label Dec 16, 2023
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 16, 2023
@mattdangerw
Copy link
Member

Just a note, we should not merge this via the "Squash and merge" button. I will rebase and push it to master so we can keep our commit history for the feature branch. Useful to bisect and go from commit -> PR via the link in the message.

mattdangerw and others added 16 commits January 4, 2024 11:00
* Demo preset directories

* Address comments

* Move packer and masker to build and debug tests

* Address comments

* Fix maskedLM preprocessor test

* Fix remaining tests

* Fix serialization tests on tf backend

---------

Co-authored-by: Neel Kovelamudi <nkovela@google.com>
As part of this work, we need to also switch all downstream
preprocessing layers to create packers on build (instead of on call).
…radigm (#1323)

* Convert SentencePiece tokenizer to save_assets/load_assets

* Convert albert to new assets paradigm

* Convert DebertaV3 to new assets paradigm

* Fix formatting issues

* Convert FNet to new assets paradigm

* Convert XLMRoberta to new assets paradigm

* Convert T5 Tokenizer to new assets paradigm

* Fix sentencepiece tokenizer config test

* Change set_vocabulary to set_proto

* Change proto to raw proto

* Change to proto_bytes
* Add metadata and Albert preset utils test

* Add Bart bytepiece preset workflow test

* Add BERT WordPiece preset workflow test

* Parameterize tests, switch to classifier, address comments

* Address comments and nits

* Fix formatting

* Add large test marker
Now that vocab and merges are not config (they are assets state), we
need to remember to copy them over when cloning a tokenizer.

This failure is only showing up on GPU testing because of
#409
We were relying on the order of `os.listdir` previously, which is
not fixed I believe.

https://github.com/keras-team/keras-nlp/pull/1333/checks?check_run_id=19158555913
In doing this we need to remove an error for the case where a user
would try to use a sequence_length longer than the supported max length
of the backbone.

preprocessor = BertPreprocessor.from_preset(
    "bert_base_uncased",
    sequence_length=1500,
)

We would do this by reaching into the backbone config to read out the
max length.

Overall I think we shoud probably avoid cross cutting dependencies like
this, a preprocessor should not reach into a backbone config. Also it
is valid to want to use the vocab of a model to preprocess at a longer
sequence length than the backbone would allow (maybe you are using a
custom model).

Instead we should probably try to make a friendly error message from
the backbone (or position embedding), if a sequence length is too long.
These are not uploaded to Kaggle just yet, but will be shortly.
This is a newly minted feature in kagglehub.

We should probably add a unit test for this, but probably after our
models are actually uploaded the hub.
* Use the proper title for example

Otherwise it won't be rendered correctly by the [`TFKerasDocumentationGenerator`](https://github.com/keras-team/keras-io/blob/fc340b9989cdf17fba44e66efa22758afad39b87/scripts/docstrings.py#L27-L28)

* more fixes
Fixes names, and avoids running on import.
mattdangerw and others added 11 commits January 4, 2024 11:02
Not currently needed for anything, just to keep in sync with KerasCV.
We have a bug where weights.h5 for a functional model will read and
write to the wrong paths in TF 2.13 and 2.14. We can work around this
for these versions (while thankfully needing none of this for Keras 3).
This we would also catch if we were running our large testing on 2.13
or 2.14. Tensorflow will turn all dictionary attributes on a layer into
"trackable dicts" python does not know how to save these, so we need to
cast our byte pair vocabulary to a dict before writing the json.
#1367)

* Change encoding to utf-8

* Change encoding to utf-8

* Change encoding to utf-8
* Fix layer checkpoint deps

* Delete layer checkpoint deps

* Change to empty list
* Allow custom preprocessor

* Allow custom preprocessor

* Fix preprocessor validation in tasks
The functional model attribute path should always take precedence, so
we can have stable checkpoints for functional subclassed models.

See keras-team/keras#18982

Note that this will cause a bunch of failures until we re-upload weights.

We should apply this workaround to Keras 3 and Keras 2 until we release
Keras 3. Then restrict to only Keras 2. Then finally delete entirely
when we drop Keras 2 support.
@mattdangerw mattdangerw merged commit 401e569 into master Jan 4, 2024
5 of 13 checks passed
@mattdangerw mattdangerw deleted the kaggle branch January 26, 2024 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants