Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization Course Issues #121

Open
KeremTurgutlu opened this issue Apr 14, 2022 · 4 comments
Open

Tokenization Course Issues #121

KeremTurgutlu opened this issue Apr 14, 2022 · 4 comments

Comments

@KeremTurgutlu
Copy link
Contributor

Hello,

I believe the corpus and the word_freqs output used in the BPE / WordPiece implementations have a mismatch simply Course -> course is not capitalized in corpus but word_freqs seem to use the capitalized version.

To reproduce

corpus = [
    "This is the Hugging Face course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    words = [word for word, _ in words_with_offsets]
    for word in words:
        word_freqs[word] += 1

assert word_freqs == defaultdict(int, {'This': 3, 'is': 2, 'the': 1, 'Hugging': 1, 'Face': 1, 'Course': 1, '.': 4, 'chapter': 1, 'about': 1,
    'tokenization': 1, 'section': 1, 'shows': 1, 'several': 1, 'tokenizer': 1, 'algorithms': 1, 'Hopefully': 1,
    ',': 1, 'you': 1, 'will': 1, 'be': 1, 'able': 1, 'to': 1, 'understand': 1, 'how': 1, 'they': 1, 'are': 1,
    'trained': 1, 'and': 1, 'generate': 1, 'tokens': 1})
@KeremTurgutlu KeremTurgutlu changed the title Tokenization Corpus Minor Difference Tokenization Course Issues Apr 15, 2022
@KeremTurgutlu
Copy link
Contributor Author

KeremTurgutlu commented Apr 15, 2022

In WordPiece if you go to line where we train the tokenizer and print the learned vocab:

print(vocab)

vocab from this print statement is missing the merge ab and has 69 merges, although vocab_size is set to 70.

@KeremTurgutlu
Copy link
Contributor Author

Same typo Course -> course is also present in Unigram. Final tokenizations assumes capital Course is used and results in ['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁', 'c', 'ou', 'r', 's', 'e', '.']. However if lowercased course is used then the tokenization would be ['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁course.']

@lewtun
Copy link
Member

lewtun commented Apr 20, 2022

Thanks for reporting these typos @KeremTurgutlu - you're totally right that the capitalization isn't applied consistently. I think the simplest change would be to capitalise Course in the corpus list - would you like to open a PR with the fixes?

@KeremTurgutlu
Copy link
Contributor Author

@lewtun created #166

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants