Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

White space at the start of each token_str. Could this affect backwards compatibility perhaps? Feature or Bug? #146

Open
david-clifford opened this issue Dec 20, 2024 · 1 comment

Comments

@david-clifford
Copy link

I just happened to realize that calls to BERT would output a token_str = "word" whereas a similar call to modernBERT outputs token_str = " word" with an additional white space before the token_str. Could this create some compatibility issues for people who want to swop in modernBERT to an existing workflow?

import torch
from transformers import pipeline
from pprint import pprint

input_text = "COVID is a [MASK]." 

pipe = pipeline(
    "fill-mask",
    model="answerdotai/ModernBERT-base",
    torch_dtype=torch.bfloat16,
)

results = pipe(input_text)
pprint(results)

pipe = pipeline(
    "fill-mask",
    model="bert-base-uncased",
    torch_dtype=torch.bfloat16,
)
results = pipe(input_text)
pprint(results)
@NohTow
Copy link
Collaborator

NohTow commented Dec 20, 2024

Hello,

We realized while testing the model that indeed, most of the tokenizer vocabulary starts with a whitespace.
This has certain implications, for example, we set the lstrip property of the [MASK] token to True so that mask infilling works even if the [MASK] token is added after a space (as in your example), because otherwise the model cannot generate the good word (as it starts with the space the user already added!).

In the case you describe, for example, the tokenizer strip the existing space before [MASK] (due to lstrip), and so the predicted token indeed only make sense in the context of the stripped sequence!
IIRC, when you use pipeline, there is an output called "sequence", where you can actually see the correct sequence, because it adds the generated tokens to the list of output ids from the tokenizer (so stripped)!

We are still exploring all the implications that this could have in existing pipelines and would be happy to provide fixes if it breaks some pipelines, so please feel free to report any issue you are facing and we will do our best to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants