-
Notifications
You must be signed in to change notification settings - Fork 779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization Course Issues #121
Comments
In WordPiece if you go to line where we train the tokenizer and print the learned vocab:
vocab from this print statement is missing the merge |
Same typo |
Thanks for reporting these typos @KeremTurgutlu - you're totally right that the capitalization isn't applied consistently. I think the simplest change would be to capitalise |
Hello,
I believe the corpus and the
word_freqs
output used in the BPE / WordPiece implementations have a mismatch simplyCourse -> course
is not capitalized in corpus butword_freqs
seem to use the capitalized version.To reproduce
The text was updated successfully, but these errors were encountered: