-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenizers.processors is not optional #1342
Comments
I think this is another case of #566 / #909 I used:
As I want to retain all characters in the text including whitespace. But it looks like you cannot have whitespace in BPE merges - so using |
Indeed #909 would fix it for the second part. As stated in that old PR, changing the serialization format is something I wouldn't do lightly (we need to take extra care to be sure we can still load all existing tokenizers, definitely doable, but the test suite is still a bit light on that front currently), Is this something that would be used for a real model ? |
I worked around it, I wanted to generate tokenised text for NER annotation. Some tools like spacy/prodigy for example make the assumption that a token is either followed by 1 or 0 spaces (i.e. there's a My actual model is trained using flair and there token object has a property that's an integer to specify the number of ws following a token (0, 1, ...) so I can use the offsets to generate this. |
Feels a bit like hack. As I mentionned, I'd be glad to enable spaces in tokens at some point, just a bit scared of the amount of time&effort to make sure we're never breaking anything. (In theory the linked PR should be enough, but theory has a way of never working out like it should...) |
Hey! Seems like this is indeed a bug. Would you like to open a PR with a fix? |
[image: image.png] csv
[image: Mailtrack]
<https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11&>
Sender
notified by
Mailtrack
<https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11&>
22/09/23,
10:07:42
…On Thu, Sep 21, 2023 at 5:01 PM Nicolas Patry ***@***.***> wrote:
Feels a bit like hack.
As I mentionned, I'd be glad to enable spaces in tokens at some point,
just a bit scared of the amount of time&effort to make sure we're never
breaking anything. (In theory the linked PR should be enough, but theory
has a way of never working out like it should...)
—
Reply to this email directly, view it on GitHub
<#1342 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AVG5TXD5E63BIHUWX2PSQQDX3PQ4FANCNFSM6AAAAAA47EYXZI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
You can train and use a tokenizer without a processor, and if you don't intend on using the tokenizer with a transformer it's not really required. However, it's not possible to load back the tokenizer.json file that's created if the value of the "post_processor" key in the json file is null (which occurs when you save a model trained without a processor.
i.e.
If
tokenizer.json
containsversion is
tokenizers==0.13.3
The text was updated successfully, but these errors were encountered: