Strange implementation of `convert_tokens_to_string` in albert tokenizer. #11646

PhilipMay · 2021-05-09T04:49:29Z

Hi,

the albert tokenizer implements the convert_tokens_to_string function:

transformers/src/transformers/models/albert/tokenization_albert.py

Lines 222 to 223 in ba0d50f

    
           out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip() 
        
           return out_string

While the deberta v2 and some other tokenizer just delegate this to the sentencepiece tokenizer:

transformers/src/transformers/models/deberta_v2/tokenization_deberta_v2.py

Line 146 in ba0d50f

return self._tokenizer.decode(tokens)

IMO it would be better to always delegate to the sentencepiece tokenizer. What do you think?

PS:

Some more examples here

transformers/src/transformers/models/barthez/tokenization_barthez.py

Lines 251 to 252 in ba0d50f

    
           out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip() 
        
           return out_string

transformers/src/transformers/models/camembert/tokenization_camembert.py

Lines 251 to 252 in ba0d50f

    
           out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip() 
        
           return out_string

transformers/src/transformers/models/m2m_100/tokenization_m2m_100.py

Lines 187 to 188 in ba0d50f

    
           out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip() 
        
           return out_string

transformers/src/transformers/models/mbart/tokenization_mbart50.py

Lines 208 to 209 in ba0d50f

    
           out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip() 
        
           return out_string

transformers/src/transformers/models/speech_to_text/tokenization_speech_to_text.py

Lines 169 to 173 in ba0d50f

    
           out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip() 
        
           if self.do_upper_case: 
        
               out_string = out_string.upper() 
        
           return out_string

transformers/src/transformers/models/xlm_prophetnet/tokenization_xlm_prophetnet.py

Lines 264 to 265 in ba0d50f

    
           out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip() 
        
           return out_string

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-05-10T06:56:14Z

Indeed, you're probably right! When updating the ALBERT tokenizer to use the sentencepiece.decode instead of the manual handling - do all tests pass? Even the integration test?

Makes me think we really should have integration tests for all tokenizers, as scenarios like this one are bound to happen.

PhilipMay · 2021-05-10T07:04:41Z

Well yes. While "adding subword regularization in more tokenizers": #11417
I recognized that the tokenizers could benefit from some bigger refactoring.
Pulling commen functions into a base class would be nice. And while doing this adding tests....
There is lot of duplicate code there...

I might do this as a PR the next days (weeks) - we will see.

PhilipMay · 2021-05-23T04:09:28Z

PR with a fix started: #11716

github-actions · 2021-06-16T15:06:28Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

PhilipMay · 2021-06-16T19:09:00Z

I am still working on this...

PhilipMay · 2021-07-21T08:47:49Z

Fixed in #11716 closing here.

PhilipMay mentioned this issue May 13, 2021

Refactor slow sentencepiece tokenizers. #11716

Merged

16 tasks

LysandreJik added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jun 17, 2021

PhilipMay closed this as completed Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange implementation of `convert_tokens_to_string` in albert tokenizer. #11646

Strange implementation of `convert_tokens_to_string` in albert tokenizer. #11646

PhilipMay commented May 9, 2021 •

edited

Loading

LysandreJik commented May 10, 2021

PhilipMay commented May 10, 2021 •

edited

Loading

PhilipMay commented May 23, 2021

github-actions bot commented Jun 16, 2021

PhilipMay commented Jun 16, 2021

PhilipMay commented Jul 21, 2021

Strange implementation of convert_tokens_to_string in albert tokenizer. #11646

Strange implementation of convert_tokens_to_string in albert tokenizer. #11646

Comments

PhilipMay commented May 9, 2021 • edited Loading

PS:

LysandreJik commented May 10, 2021

PhilipMay commented May 10, 2021 • edited Loading

PhilipMay commented May 23, 2021

github-actions bot commented Jun 16, 2021

PhilipMay commented Jun 16, 2021

PhilipMay commented Jul 21, 2021

Strange implementation of `convert_tokens_to_string` in albert tokenizer. #11646

Strange implementation of `convert_tokens_to_string` in albert tokenizer. #11646

PhilipMay commented May 9, 2021 •

edited

Loading

PhilipMay commented May 10, 2021 •

edited

Loading