Kaggle Presets #1365

sampathweb · 2023-12-12T22:24:06Z

No description provided.

nkovela1

Thank you for your help on the rebase and fixes here Ramesh! LGTM!

Excited for the Kaggle integration and stable release!

mattdangerw · 2023-12-27T22:41:17Z

Just a note, we should not merge this via the "Squash and merge" button. I will rebase and push it to master so we can keep our commit history for the feature branch. Useful to bisect and go from commit -> PR via the link in the message.

* Demo preset directories * Address comments * Move packer and masker to build and debug tests * Address comments * Fix maskedLM preprocessor test * Fix remaining tests * Fix serialization tests on tf backend --------- Co-authored-by: Neel Kovelamudi <nkovela@google.com>

As part of this work, we need to also switch all downstream preprocessing layers to create packers on build (instead of on call).

…radigm (#1323) * Convert SentencePiece tokenizer to save_assets/load_assets * Convert albert to new assets paradigm * Convert DebertaV3 to new assets paradigm * Fix formatting issues * Convert FNet to new assets paradigm * Convert XLMRoberta to new assets paradigm * Convert T5 Tokenizer to new assets paradigm * Fix sentencepiece tokenizer config test * Change set_vocabulary to set_proto * Change proto to raw proto * Change to proto_bytes

* Add metadata and Albert preset utils test * Add Bart bytepiece preset workflow test * Add BERT WordPiece preset workflow test * Parameterize tests, switch to classifier, address comments * Address comments and nits * Fix formatting * Add large test marker

Now that vocab and merges are not config (they are assets state), we need to remember to copy them over when cloning a tokenizer. This failure is only showing up on GPU testing because of #409

We were relying on the order of `os.listdir` previously, which is not fixed I believe. https://github.com/keras-team/keras-nlp/pull/1333/checks?check_run_id=19158555913

In doing this we need to remove an error for the case where a user would try to use a sequence_length longer than the supported max length of the backbone. preprocessor = BertPreprocessor.from_preset( "bert_base_uncased", sequence_length=1500, ) We would do this by reaching into the backbone config to read out the max length. Overall I think we shoud probably avoid cross cutting dependencies like this, a preprocessor should not reach into a backbone config. Also it is valid to want to use the vocab of a model to preprocess at a longer sequence length than the backbone would allow (maybe you are using a custom model). Instead we should probably try to make a friendly error message from the backbone (or position embedding), if a sequence length is too long.

These are not uploaded to Kaggle just yet, but will be shortly.

This is a newly minted feature in kagglehub. We should probably add a unit test for this, but probably after our models are actually uploaded the hub.

* Use the proper title for example Otherwise it won't be rendered correctly by the [`TFKerasDocumentationGenerator`](https://github.com/keras-team/keras-io/blob/fc340b9989cdf17fba44e66efa22758afad39b87/scripts/docstrings.py#L27-L28) * more fixes

Fixes names, and avoids running on import.

Not currently needed for anything, just to keep in sync with KerasCV.

We have a bug where weights.h5 for a functional model will read and write to the wrong paths in TF 2.13 and 2.14. We can work around this for these versions (while thankfully needing none of this for Keras 3).

This we would also catch if we were running our large testing on 2.13 or 2.14. Tensorflow will turn all dictionary attributes on a layer into "trackable dicts" python does not know how to save these, so we need to cast our byte pair vocabulary to a dict before writing the json.

#1367) * Change encoding to utf-8 * Change encoding to utf-8 * Change encoding to utf-8

* Fix layer checkpoint deps * Delete layer checkpoint deps * Change to empty list

…1369)

* Allow custom preprocessor * Allow custom preprocessor * Fix preprocessor validation in tasks

The functional model attribute path should always take precedence, so we can have stable checkpoints for functional subclassed models. See keras-team/keras#18982 Note that this will cause a bunch of failures until we re-upload weights. We should apply this workaround to Keras 3 and Keras 2 until we release Keras 3. Then restrict to only Keras 2. Then finally delete entirely when we drop Keras 2 support.

sampathweb requested review from grasskin and nkovela1 December 12, 2023 22:24

sampathweb added the kokoro:force-run Runs Tests on GPU label Dec 12, 2023

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 12, 2023

nkovela1 added the kokoro:force-run Runs Tests on GPU label Dec 12, 2023

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 12, 2023

sampathweb added the kokoro:force-run Runs Tests on GPU label Dec 13, 2023

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 13, 2023

nkovela1 approved these changes Dec 13, 2023

View reviewed changes

nkovela1 added the kokoro:force-run Runs Tests on GPU label Dec 15, 2023

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 15, 2023

nkovela1 added the kokoro:force-run Runs Tests on GPU label Dec 16, 2023

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 16, 2023

mattdangerw and others added 16 commits January 4, 2024 11:00

Switch byte pair tokenizer to save_assets/load_assets (#1322)

21fb04c

As part of this work, we need to also switch all downstream preprocessing layers to create packers on build (instead of on call).

Automatically add the keras framework to kaggle handles (#1331)

38806fd

Fix a failing byte pair tokenizer test (#1336)

e0d34dc

Now that vocab and merges are not config (they are assets state), we need to remember to copy them over when cloning a tokenizer. This failure is only showing up on GPU testing because of #409

Use set comparison for assets (#1335)

0820d62

We were relying on the order of `os.listdir` previously, which is not fixed I believe. https://github.com/keras-team/keras-nlp/pull/1333/checks?check_run_id=19158555913

Fix whisper tokenizer saving (#1334)

c4b0c3c

Fix t5 tokenizer presets (#1339)

dbb6487

Script to convert presets (#1340)

6130253

Switch all preset to the new Kaggle format (#1338)

814959b

These are not uploaded to Kaggle just yet, but will be shortly.

Let kagglehub select latest version (#1342)

2aced24

This is a newly minted feature in kagglehub. We should probably add a unit test for this, but probably after our models are actually uploaded the hub.

Update conversion script (#1347)

6ad8a30

Fixes names, and avoids running on import.

Improve preset error messages (#1349)

7cc4323

mattdangerw and others added 11 commits January 4, 2024 11:02

Use subclass checking check_preset_class (#1344)

9cc8110

Not currently needed for anything, just to keep in sync with KerasCV.

Add a hacky fix for TF 2.13 and 2.14 weights.h5 loading (#1353)

4606f32

We have a bug where weights.h5 for a functional model will read and write to the wrong paths in TF 2.13 and 2.14. We can work around this for these versions (while thankfully needing none of this for Keras 3).

Switch our preset to there final kaggle location (#1345)

039ff45

Fix rebase issue in bytepair tokenizer (#1366)

9cc3f84

Change encoding to utf-8 to fix Kaggle branch test failure for PyTorch (

6f7f9a0

#1367) * Change encoding to utf-8 * Change encoding to utf-8 * Change encoding to utf-8

Fix GPU test issue with Keras 2 (#1368)

ddfca77

* Fix layer checkpoint deps * Delete layer checkpoint deps * Change to empty list

Add in-place modification of file keys for backwards compatibility (#…

0e43f09

…1369)

Add file renaming logic for modification (#1370)

4d84eb1

Fix task pre-processor in tasks (#1373)

29a0ae5

* Allow custom preprocessor * Allow custom preprocessor * Fix preprocessor validation in tasks

mattdangerw force-pushed the kaggle branch from e083be9 to 401e569 Compare January 4, 2024 19:06

mattdangerw merged commit 401e569 into master Jan 4, 2024
5 of 13 checks passed

mattdangerw deleted the kaggle branch January 26, 2024 02:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kaggle Presets #1365

Kaggle Presets #1365

sampathweb commented Dec 12, 2023

nkovela1 left a comment

mattdangerw commented Dec 27, 2023

Kaggle Presets #1365

Kaggle Presets #1365

Conversation

sampathweb commented Dec 12, 2023

nkovela1 left a comment

Choose a reason for hiding this comment

mattdangerw commented Dec 27, 2023