Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix OBO error while reading vocab files with empty lines in BERT tokenizer #1841

Merged
merged 1 commit into from
Jul 18, 2022

Conversation

parmeet
Copy link
Contributor

@parmeet parmeet commented Jul 18, 2022

fixes #1840

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I was wondering if we considered modifying the existing _load_vocab_from_file implementation to add a flag to include newline characters? I know that the existing implementation has the options for multithreading as well as other perf optimizations (which I guess may not be too relevant to load in a smaller vocab file as might be the case for the BERT vocab).

@parmeet
Copy link
Contributor Author

parmeet commented Jul 18, 2022

I was wondering if we considered modifying the existing _load_vocab_from_file implementation to add a flag to include newline characters?

This is special handling just to take into consideration vocab files from HF, so i have created explicit reading utility inside BERTTokenizer. I wasn't so sure if in general newlines are part of Vocab and if we should add it to _load_vocab_from_file?

@parmeet parmeet merged commit bb58f6e into pytorch:main Jul 18, 2022
@parmeet parmeet deleted the fix_OBO_error branch July 18, 2022 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Off by one (OBO) error in torchtext implementation of BERTTokenizer
3 participants