Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing tokenizer tests - Longformer #17677

Merged

Conversation

tgadeliya
Copy link
Contributor

What does this PR do?

This PR add tests for Longformer tokenizer copying tests from Roberta tokenizer's test suite, because those tokenizers are absolutely identical.

Fixes #16627

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@SaulLu @LysandreJik

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 11, 2022

The documentation is not available anymore as the PR was closed or merged.

@tgadeliya
Copy link
Contributor Author

I read discussion in merged tokenizers' tests PRs and post Don't Repeat Yourself* on HF blog and I manually add "the copying mechanism". But I don't understand how it is work, so I tried not to change copied test code from Roberta tokenizer tests. If code modification is not a problem, I would like to add some minor changes, e.g. delete commented code and split big test into smaller one.
Could describe "copying mechanism" works in more details?

@LysandreJik LysandreJik requested a review from SaulLu June 13, 2022 07:24
@github-actions github-actions bot closed this Jul 21, 2022
@huggingface huggingface deleted a comment from github-actions bot Jul 22, 2022
@SaulLu SaulLu reopened this Jul 22, 2022
@SaulLu
Copy link
Contributor

SaulLu commented Jul 22, 2022

Thanks a lot for working on this @tgadeliya!!

As far as I know, there are no identified "practices" for this case (cc @LysandreJik in case you have another opinion). Nevertheless, if changes are relevant, they are obviously welcome. For example, it is possible to indicate the changes made as here:

# Copied from transformers.models.bert.modeling_bert.BertIntermediate with Bert->Deberta
class DebertaIntermediate(nn.Module):

If the differences are too long to list perhaps the message can just explain why it diverged from the originally copied and pasted code.

Does this help you?

@huggingface huggingface deleted a comment from github-actions bot Aug 16, 2022
@tgadeliya
Copy link
Contributor Author

@SaulLu, Sorry for the late reply. Summer is ending :)

Thanks for your comment. Now it is clear for me. Actually, I came to the conclusion, that code cleaning not so necessary considering all pros and cons. So this PR can be reviewed and merged

@SaulLu SaulLu changed the title [WIP] Add missing tokenizer tests - Longformer Add missing tokenizer tests - Longformer Aug 19, 2022
Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great to me! Can you just merge/rebase on main so we can merge your PR?

@tgadeliya
Copy link
Contributor Author

@SaulLu I refreshed this PR, so now it is ready to merge

@SaulLu SaulLu merged commit 0f257a8 into huggingface:main Aug 22, 2022
@SaulLu
Copy link
Contributor

SaulLu commented Aug 22, 2022

Thanks @tgadeliya 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add missing tokenizer test files [:building_construction: in progress]
3 participants