Tokenization Course Issues #121

KeremTurgutlu · 2022-04-14T19:33:55Z

Hello,

I believe the corpus and the word_freqs output used in the BPE / WordPiece implementations have a mismatch simply Course -> course is not capitalized in corpus but word_freqs seem to use the capitalized version.

To reproduce

corpus = [
    "This is the Hugging Face course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    words = [word for word, _ in words_with_offsets]
    for word in words:
        word_freqs[word] += 1

assert word_freqs == defaultdict(int, {'This': 3, 'is': 2, 'the': 1, 'Hugging': 1, 'Face': 1, 'Course': 1, '.': 4, 'chapter': 1, 'about': 1,
    'tokenization': 1, 'section': 1, 'shows': 1, 'several': 1, 'tokenizer': 1, 'algorithms': 1, 'Hopefully': 1,
    ',': 1, 'you': 1, 'will': 1, 'be': 1, 'able': 1, 'to': 1, 'understand': 1, 'how': 1, 'they': 1, 'are': 1,
    'trained': 1, 'and': 1, 'generate': 1, 'tokens': 1})

The text was updated successfully, but these errors were encountered:

KeremTurgutlu · 2022-04-15T03:57:46Z

In WordPiece if you go to line where we train the tokenizer and print the learned vocab:

print(vocab)

vocab from this print statement is missing the merge ab and has 69 merges, although vocab_size is set to 70.

KeremTurgutlu · 2022-04-15T17:57:01Z

Same typo Course -> course is also present in Unigram. Final tokenizations assumes capital Course is used and results in ['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁', 'c', 'ou', 'r', 's', 'e', '.']. However if lowercased course is used then the tokenization would be ['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁course.']

lewtun · 2022-04-20T13:46:55Z

Thanks for reporting these typos @KeremTurgutlu - you're totally right that the capitalization isn't applied consistently. I think the simplest change would be to capitalise Course in the corpus list - would you like to open a PR with the fixes?

KeremTurgutlu · 2022-05-07T04:18:38Z

@lewtun created #166

KeremTurgutlu changed the title ~~Tokenization Corpus Minor Difference~~ Tokenization Course Issues Apr 15, 2022

KeremTurgutlu mentioned this issue May 7, 2022

fix typos in bpe, wordpiece, unigram #166

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization Course Issues #121

Tokenization Course Issues #121

KeremTurgutlu commented Apr 14, 2022

KeremTurgutlu commented Apr 15, 2022 •

edited

Loading

KeremTurgutlu commented Apr 15, 2022

lewtun commented Apr 20, 2022

KeremTurgutlu commented May 7, 2022

Tokenization Course Issues #121

Tokenization Course Issues #121

Comments

KeremTurgutlu commented Apr 14, 2022

KeremTurgutlu commented Apr 15, 2022 • edited Loading

KeremTurgutlu commented Apr 15, 2022

lewtun commented Apr 20, 2022

KeremTurgutlu commented May 7, 2022

KeremTurgutlu commented Apr 15, 2022 •

edited

Loading