Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange implementation of convert_tokens_to_string in albert tokenizer. #11646

Closed
PhilipMay opened this issue May 9, 2021 · 6 comments
Closed
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Comments

@PhilipMay
Copy link
Contributor

PhilipMay commented May 9, 2021

Hi,

the albert tokenizer implements the convert_tokens_to_string function:

out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
return out_string

While the deberta v2 and some other tokenizer just delegate this to the sentencepiece tokenizer:

IMO it would be better to always delegate to the sentencepiece tokenizer. What do you think?

PS:

Some more examples here

out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
return out_string

out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
return out_string

out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
return out_string

out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
return out_string

out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
if self.do_upper_case:
out_string = out_string.upper()
return out_string

out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
return out_string

@LysandreJik
Copy link
Member

Indeed, you're probably right! When updating the ALBERT tokenizer to use the sentencepiece.decode instead of the manual handling - do all tests pass? Even the integration test?

Makes me think we really should have integration tests for all tokenizers, as scenarios like this one are bound to happen.

@PhilipMay
Copy link
Contributor Author

PhilipMay commented May 10, 2021

Well yes. While "adding subword regularization in more tokenizers": #11417
I recognized that the tokenizers could benefit from some bigger refactoring.
Pulling commen functions into a base class would be nice. And while doing this adding tests....
There is lot of duplicate code there...

I might do this as a PR the next days (weeks) - we will see.

@PhilipMay
Copy link
Contributor Author

PR with a fix started: #11716

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@PhilipMay
Copy link
Contributor Author

I am still working on this...

@LysandreJik LysandreJik added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jun 17, 2021
@PhilipMay
Copy link
Contributor Author

Fixed in #11716 closing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

No branches or pull requests

2 participants