tokenizers.processors is not optional #1342

david-waterworth · 2023-09-20T01:55:21Z

You can train and use a tokenizer without a processor, and if you don't intend on using the tokenizer with a transformer it's not really required. However, it's not possible to load back the tokenizer.json file that's created if the value of the "post_processor" key in the json file is null (which occurs when you save a model trained without a processor.

i.e.

tokenizers.Tokenizer.from_file(f"models/tokenizer/tokenizer.json")
Exception: data did not match any variant of untagged enum ModelWrapper at line 6032 column 3

If tokenizer.json contains

  "post_processor": null,

version is tokenizers==0.13.3

The text was updated successfully, but these errors were encountered:

david-waterworth · 2023-09-20T02:16:44Z

I think this is another case of #566 / #909

I used:

tokenizers.pre_tokenizers.Split(tokenizers.Regex(r"\w+|[^\w\s]+"), behavior='isolated')

As I want to retain all characters in the text including whitespace. But it looks like you cannot have whitespace in BPE merges - so using Split instead of ByteLevel results in an invalid JSON file.

Narsil · 2023-09-21T06:31:53Z

Indeed #909 would fix it for the second part.

As stated in that old PR, changing the serialization format is something I wouldn't do lightly (we need to take extra care to be sure we can still load all existing tokenizers, definitely doable, but the test suite is still a bit light on that front currently),

Is this something that would be used for a real model ?

david-waterworth · 2023-09-21T06:57:50Z

I worked around it, I wanted to generate tokenised text for NER annotation. Some tools like spacy/prodigy for example make the assumption that a token is either followed by 1 or 0 spaces (i.e. there's a ws property that's a boolean) which is unfortunate as it doesn't allow the original text to be recovered from a list of tokens. So I wanted to create whitespace tokens (my actual data usually isn't ws delimited, most of it uses punctuation like -_. etc).

My actual model is trained using flair and there token object has a property that's an integer to specify the number of ws following a token (0, 1, ...) so I can use the offsets to generate this.

Narsil · 2023-09-21T07:01:04Z

Feels a bit like hack.

As I mentionned, I'd be glad to enable spaces in tokens at some point, just a bit scared of the amount of time&effort to make sure we're never breaking anything. (In theory the linked PR should be enough, but theory has a way of never working out like it should...)

ArthurZucker · 2023-09-21T09:50:32Z

Hey! Seems like this is indeed a bug. Would you like to open a PR with a fix?
Can you confirm that you save the tokenizer using tokenizers.Tokenizer.save() ?

tommy447 · 2023-09-22T00:09:06Z

[image: image.png] csv [image: Mailtrack] <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11&> Sender notified by Mailtrack <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11&> 22/09/23, 10:07:42

…

On Thu, Sep 21, 2023 at 5:01 PM Nicolas Patry ***@***.***> wrote: Feels a bit like hack. As I mentionned, I'd be glad to enable spaces in tokens at some point, just a bit scared of the amount of time&effort to make sure we're never breaking anything. (In theory the linked PR should be enough, but theory has a way of never working out like it should...) — Reply to this email directly, view it on GitHub <#1342 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AVG5TXD5E63BIHUWX2PSQQDX3PQ4FANCNFSM6AAAAAA47EYXZI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

david-waterworth closed this as completed Sep 21, 2023

mcognetta mentioned this issue May 25, 2024

Deserializing BPE tokenizer failure #1541

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizers.processors is not optional #1342

tokenizers.processors is not optional #1342

david-waterworth commented Sep 20, 2023

david-waterworth commented Sep 20, 2023 •

edited

Loading

Narsil commented Sep 21, 2023

david-waterworth commented Sep 21, 2023

Narsil commented Sep 21, 2023

ArthurZucker commented Sep 21, 2023

tommy447 commented Sep 22, 2023 via email

tokenizers.processors is not optional #1342

tokenizers.processors is not optional #1342

Comments

david-waterworth commented Sep 20, 2023

david-waterworth commented Sep 20, 2023 • edited Loading

Narsil commented Sep 21, 2023

david-waterworth commented Sep 21, 2023

Narsil commented Sep 21, 2023

ArthurZucker commented Sep 21, 2023

tommy447 commented Sep 22, 2023 via email

david-waterworth commented Sep 20, 2023 •

edited

Loading