White space at the start of each token_str. Could this affect backwards compatibility perhaps? Feature or Bug? #146

david-clifford · 2024-12-20T12:35:32Z

I just happened to realize that calls to BERT would output a token_str = "word" whereas a similar call to modernBERT outputs token_str = " word" with an additional white space before the token_str. Could this create some compatibility issues for people who want to swop in modernBERT to an existing workflow?

import torch
from transformers import pipeline
from pprint import pprint

input_text = "COVID is a [MASK]." 

pipe = pipeline(
    "fill-mask",
    model="answerdotai/ModernBERT-base",
    torch_dtype=torch.bfloat16,
)

results = pipe(input_text)
pprint(results)

pipe = pipeline(
    "fill-mask",
    model="bert-base-uncased",
    torch_dtype=torch.bfloat16,
)
results = pipe(input_text)
pprint(results)

The text was updated successfully, but these errors were encountered:

NohTow · 2024-12-20T13:09:05Z

Hello,

We realized while testing the model that indeed, most of the tokenizer vocabulary starts with a whitespace.
This has certain implications, for example, we set the lstrip property of the [MASK] token to True so that mask infilling works even if the [MASK] token is added after a space (as in your example), because otherwise the model cannot generate the good word (as it starts with the space the user already added!).

In the case you describe, for example, the tokenizer strip the existing space before [MASK] (due to lstrip), and so the predicted token indeed only make sense in the context of the stripped sequence!
IIRC, when you use pipeline, there is an output called "sequence", where you can actually see the correct sequence, because it adds the generated tokens to the list of output ids from the tokenizer (so stripped)!

We are still exploring all the implications that this could have in existing pipelines and would be happy to provide fixes if it breaks some pipelines, so please feel free to report any issue you are facing and we will do our best to help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

White space at the start of each token_str. Could this affect backwards compatibility perhaps? Feature or Bug? #146

White space at the start of each token_str. Could this affect backwards compatibility perhaps? Feature or Bug? #146

david-clifford commented Dec 20, 2024

NohTow commented Dec 20, 2024

White space at the start of each token_str. Could this affect backwards compatibility perhaps? Feature or Bug? #146

White space at the start of each token_str. Could this affect backwards compatibility perhaps? Feature or Bug? #146

Comments

david-clifford commented Dec 20, 2024

NohTow commented Dec 20, 2024