fix OBO error while reading vocab files with empty lines in BERT tokenizer #1841

parmeet · 2022-07-18T01:20:03Z

Nayef211

Overall LGTM. I was wondering if we considered modifying the existing _load_vocab_from_file implementation to add a flag to include newline characters? I know that the existing implementation has the options for multithreading as well as other perf optimizations (which I guess may not be too relevant to load in a smaller vocab file as might be the case for the BERT vocab).

parmeet · 2022-07-18T17:30:31Z

I was wondering if we considered modifying the existing _load_vocab_from_file implementation to add a flag to include newline characters?

This is special handling just to take into consideration vocab files from HF, so i have created explicit reading utility inside BERTTokenizer. I wasn't so sure if in general newlines are part of Vocab and if we should add it to _load_vocab_from_file?

fix OBO error for vocab files with empty lines

b7f503c

facebook-github-bot added the cla signed label Jul 18, 2022

parmeet requested a review from Nayef211 July 18, 2022 13:29

Nayef211 approved these changes Jul 18, 2022

View reviewed changes

parmeet merged commit bb58f6e into pytorch:main Jul 18, 2022

parmeet deleted the fix_OBO_error branch July 18, 2022 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix OBO error while reading vocab files with empty lines in BERT tokenizer #1841

fix OBO error while reading vocab files with empty lines in BERT tokenizer #1841

parmeet commented Jul 18, 2022

Nayef211 left a comment

parmeet commented Jul 18, 2022

fix OBO error while reading vocab files with empty lines in BERT tokenizer #1841

fix OBO error while reading vocab files with empty lines in BERT tokenizer #1841

Conversation

parmeet commented Jul 18, 2022

Nayef211 left a comment

Choose a reason for hiding this comment

parmeet commented Jul 18, 2022