Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizers.processors is not optional #1342

Closed
david-waterworth opened this issue Sep 20, 2023 · 6 comments
Closed

tokenizers.processors is not optional #1342

david-waterworth opened this issue Sep 20, 2023 · 6 comments

Comments

@david-waterworth
Copy link

You can train and use a tokenizer without a processor, and if you don't intend on using the tokenizer with a transformer it's not really required. However, it's not possible to load back the tokenizer.json file that's created if the value of the "post_processor" key in the json file is null (which occurs when you save a model trained without a processor.

i.e.

tokenizers.Tokenizer.from_file(f"models/tokenizer/tokenizer.json")
Exception: data did not match any variant of untagged enum ModelWrapper at line 6032 column 3

If tokenizer.json contains

  "post_processor": null,

version is tokenizers==0.13.3

@david-waterworth
Copy link
Author

david-waterworth commented Sep 20, 2023

I think this is another case of #566 / #909

I used:

tokenizers.pre_tokenizers.Split(tokenizers.Regex(r"\w+|[^\w\s]+"), behavior='isolated')

As I want to retain all characters in the text including whitespace. But it looks like you cannot have whitespace in BPE merges - so using Split instead of ByteLevel results in an invalid JSON file.

@Narsil
Copy link
Collaborator

Narsil commented Sep 21, 2023

Indeed #909 would fix it for the second part.

As stated in that old PR, changing the serialization format is something I wouldn't do lightly (we need to take extra care to be sure we can still load all existing tokenizers, definitely doable, but the test suite is still a bit light on that front currently),

Is this something that would be used for a real model ?

@david-waterworth
Copy link
Author

I worked around it, I wanted to generate tokenised text for NER annotation. Some tools like spacy/prodigy for example make the assumption that a token is either followed by 1 or 0 spaces (i.e. there's a ws property that's a boolean) which is unfortunate as it doesn't allow the original text to be recovered from a list of tokens. So I wanted to create whitespace tokens (my actual data usually isn't ws delimited, most of it uses punctuation like -_. etc).

My actual model is trained using flair and there token object has a property that's an integer to specify the number of ws following a token (0, 1, ...) so I can use the offsets to generate this.

@Narsil
Copy link
Collaborator

Narsil commented Sep 21, 2023

Feels a bit like hack.

As I mentionned, I'd be glad to enable spaces in tokens at some point, just a bit scared of the amount of time&effort to make sure we're never breaking anything. (In theory the linked PR should be enough, but theory has a way of never working out like it should...)

@ArthurZucker
Copy link
Collaborator

Hey! Seems like this is indeed a bug. Would you like to open a PR with a fix?
Can you confirm that you save the tokenizer using tokenizers.Tokenizer.save() ?

@tommy447
Copy link

tommy447 commented Sep 22, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants