You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just happened to realize that calls to BERT would output a token_str = "word" whereas a similar call to modernBERT outputs token_str = " word" with an additional white space before the token_str. Could this create some compatibility issues for people who want to swop in modernBERT to an existing workflow?
import torch
from transformers import pipeline
from pprint import pprint
input_text = "COVID is a [MASK]."
pipe = pipeline(
"fill-mask",
model="answerdotai/ModernBERT-base",
torch_dtype=torch.bfloat16,
)
results = pipe(input_text)
pprint(results)
pipe = pipeline(
"fill-mask",
model="bert-base-uncased",
torch_dtype=torch.bfloat16,
)
results = pipe(input_text)
pprint(results)
The text was updated successfully, but these errors were encountered:
We realized while testing the model that indeed, most of the tokenizer vocabulary starts with a whitespace.
This has certain implications, for example, we set the lstrip property of the [MASK] token to True so that mask infilling works even if the [MASK] token is added after a space (as in your example), because otherwise the model cannot generate the good word (as it starts with the space the user already added!).
In the case you describe, for example, the tokenizer strip the existing space before [MASK] (due to lstrip), and so the predicted token indeed only make sense in the context of the stripped sequence!
IIRC, when you use pipeline, there is an output called "sequence", where you can actually see the correct sequence, because it adds the generated tokens to the list of output ids from the tokenizer (so stripped)!
We are still exploring all the implications that this could have in existing pipelines and would be happy to provide fixes if it breaks some pipelines, so please feel free to report any issue you are facing and we will do our best to help!
I just happened to realize that calls to BERT would output a
token_str = "word"
whereas a similar call to modernBERT outputstoken_str = " word"
with an additional white space before thetoken_str
. Could this create some compatibility issues for people who want to swop in modernBERT to an existing workflow?The text was updated successfully, but these errors were encountered: