-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kaggle Presets #1365
Kaggle Presets #1365
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your help on the rebase and fixes here Ramesh! LGTM!
Excited for the Kaggle integration and stable release!
Just a note, we should not merge this via the "Squash and merge" button. I will rebase and push it to master so we can keep our commit history for the feature branch. Useful to bisect and go from commit -> PR via the link in the message. |
* Demo preset directories * Address comments * Move packer and masker to build and debug tests * Address comments * Fix maskedLM preprocessor test * Fix remaining tests * Fix serialization tests on tf backend --------- Co-authored-by: Neel Kovelamudi <nkovela@google.com>
As part of this work, we need to also switch all downstream preprocessing layers to create packers on build (instead of on call).
…radigm (#1323) * Convert SentencePiece tokenizer to save_assets/load_assets * Convert albert to new assets paradigm * Convert DebertaV3 to new assets paradigm * Fix formatting issues * Convert FNet to new assets paradigm * Convert XLMRoberta to new assets paradigm * Convert T5 Tokenizer to new assets paradigm * Fix sentencepiece tokenizer config test * Change set_vocabulary to set_proto * Change proto to raw proto * Change to proto_bytes
* Add metadata and Albert preset utils test * Add Bart bytepiece preset workflow test * Add BERT WordPiece preset workflow test * Parameterize tests, switch to classifier, address comments * Address comments and nits * Fix formatting * Add large test marker
Now that vocab and merges are not config (they are assets state), we need to remember to copy them over when cloning a tokenizer. This failure is only showing up on GPU testing because of #409
We were relying on the order of `os.listdir` previously, which is not fixed I believe. https://github.com/keras-team/keras-nlp/pull/1333/checks?check_run_id=19158555913
In doing this we need to remove an error for the case where a user would try to use a sequence_length longer than the supported max length of the backbone. preprocessor = BertPreprocessor.from_preset( "bert_base_uncased", sequence_length=1500, ) We would do this by reaching into the backbone config to read out the max length. Overall I think we shoud probably avoid cross cutting dependencies like this, a preprocessor should not reach into a backbone config. Also it is valid to want to use the vocab of a model to preprocess at a longer sequence length than the backbone would allow (maybe you are using a custom model). Instead we should probably try to make a friendly error message from the backbone (or position embedding), if a sequence length is too long.
These are not uploaded to Kaggle just yet, but will be shortly.
This is a newly minted feature in kagglehub. We should probably add a unit test for this, but probably after our models are actually uploaded the hub.
* Use the proper title for example Otherwise it won't be rendered correctly by the [`TFKerasDocumentationGenerator`](https://github.com/keras-team/keras-io/blob/fc340b9989cdf17fba44e66efa22758afad39b87/scripts/docstrings.py#L27-L28) * more fixes
Fixes names, and avoids running on import.
Not currently needed for anything, just to keep in sync with KerasCV.
We have a bug where weights.h5 for a functional model will read and write to the wrong paths in TF 2.13 and 2.14. We can work around this for these versions (while thankfully needing none of this for Keras 3).
This we would also catch if we were running our large testing on 2.13 or 2.14. Tensorflow will turn all dictionary attributes on a layer into "trackable dicts" python does not know how to save these, so we need to cast our byte pair vocabulary to a dict before writing the json.
#1367) * Change encoding to utf-8 * Change encoding to utf-8 * Change encoding to utf-8
* Fix layer checkpoint deps * Delete layer checkpoint deps * Change to empty list
* Allow custom preprocessor * Allow custom preprocessor * Fix preprocessor validation in tasks
The functional model attribute path should always take precedence, so we can have stable checkpoints for functional subclassed models. See keras-team/keras#18982 Note that this will cause a bunch of failures until we re-upload weights. We should apply this workaround to Keras 3 and Keras 2 until we release Keras 3. Then restrict to only Keras 2. Then finally delete entirely when we drop Keras 2 support.
No description provided.