Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow passing kwargs through to TFBertTokenizer #24324

Merged
merged 1 commit into from
Jun 20, 2023

Conversation

Rocketknight1
Copy link
Member

@Rocketknight1 Rocketknight1 commented Jun 16, 2023

There are some kwargs like preserve_unused_tokens in the underlying TF tokenizer layers that might be useful to expose to users. This PR exposes them by passing through any unrecognized kwargs in the model __init__ to the TF tokenizer layer.

Fixes #23798

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 16, 2023

The documentation is not available anymore as the PR was closed or merged.

@Rocketknight1
Copy link
Member Author

Rocketknight1 commented Jun 19, 2023

Ping @amyeroberts for core maintainer review now that the extra functionality is working fine (see issue #23798)

@Rocketknight1 Rocketknight1 requested review from sgugger and amyeroberts and removed request for sgugger June 19, 2023 13:14
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding!

In general, we definitely don't want to add more **kwargs arguments. As this is a wrapper for a TF implemented tokenizer, I think it's OK. Only niggle is it might result in some conflicting behaviour e.g. does FastBertTokenizer take any arguments which are equivalent to e.g. truncation ?

@Rocketknight1
Copy link
Member Author

I was a bit wary about the kwargs thing too - FastBertTokenizer and BertTokenizerLayer actually have wildly different arguments, so depending on which one you're using the kwargs you need will be totally different. Still, I think for an advanced use case it's fine - we're just trying to enable some power user behaviours without forcing them to edit the library source, and I'd prefer something general like this over specifically exposing the options I think people need (because I didn't even realize in advance that the preserve_unused arg would be valuable!)

Anyway, merging!

@Rocketknight1 Rocketknight1 merged commit 0875b25 into main Jun 20, 2023
@Rocketknight1 Rocketknight1 deleted the allow_tf_tokenizer_kwargs branch June 20, 2023 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TFBertTokenizer - support for "never_split"
3 participants