-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange implementation of convert_tokens_to_string
in albert tokenizer.
#11646
Comments
Indeed, you're probably right! When updating the ALBERT tokenizer to use the Makes me think we really should have integration tests for all tokenizers, as scenarios like this one are bound to happen. |
Well yes. While "adding subword regularization in more tokenizers": #11417 I might do this as a PR the next days (weeks) - we will see. |
PR with a fix started: #11716 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I am still working on this... |
Fixed in #11716 closing here. |
Hi,
the albert tokenizer implements the
convert_tokens_to_string
function:transformers/src/transformers/models/albert/tokenization_albert.py
Lines 222 to 223 in ba0d50f
While the deberta v2 and some other tokenizer just delegate this to the sentencepiece tokenizer:
transformers/src/transformers/models/deberta_v2/tokenization_deberta_v2.py
Line 146 in ba0d50f
IMO it would be better to always delegate to the sentencepiece tokenizer. What do you think?
PS:
Some more examples here
transformers/src/transformers/models/barthez/tokenization_barthez.py
Lines 251 to 252 in ba0d50f
transformers/src/transformers/models/camembert/tokenization_camembert.py
Lines 251 to 252 in ba0d50f
transformers/src/transformers/models/m2m_100/tokenization_m2m_100.py
Lines 187 to 188 in ba0d50f
transformers/src/transformers/models/mbart/tokenization_mbart50.py
Lines 208 to 209 in ba0d50f
transformers/src/transformers/models/speech_to_text/tokenization_speech_to_text.py
Lines 169 to 173 in ba0d50f
transformers/src/transformers/models/xlm_prophetnet/tokenization_xlm_prophetnet.py
Lines 264 to 265 in ba0d50f
The text was updated successfully, but these errors were encountered: