Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One problem in the annotations of test_wikipedia_example in the tests/test_tokenizer file #93

Open
donglinkang2021 opened this issue Nov 18, 2024 · 0 comments

Comments

@donglinkang2021
Copy link

Thanks Andrej for his great youtube vedio and this repo.
I think that there maybe some wrong from wikipedia annotation here, in the function test_wikipedia_example, it says

According to Wikipedia, running bpe on the input string:
"aaabdaaabac"

for 3 merges will result in string:
"XdXac"

where:
X=ZY
Y=ab
Z=aa

Keep in mind that for us a=97, b=98, c=99, d=100 (ASCII values)
so Z will be 256, Y will be 257, X will be 258.

So I manually verified the process myself, as follows:

  1. Initial encoding:

    [97, 97, 97, 98, 100, 97, 97, 97, 98, 97, 99]
    
  2. First merge, for the (97, 97) has the most occurrences, so merge it to 256 (Z = aa -> 256):

    [256, 97, 98, 100, 256, 97, 98, 97, 99]
    
  3. Second merge, here (256, 97) and (97, 98) have the same most occurrences, merge the first one (Y = Za -> 257):

    [257, 98, 100, 257, 98, 97, 99]
    
  4. Third merge, here (257, 98) have the most occurrences, merge it to 258 (X = Yb -> 258):

    [258, 100, 258, 97, 99]
    

Below is the verification code (fortunately, Andrej added a verbose parameter to .train(), so I don't need to write additional output printing code for verification)

from minbpe import BasicTokenizer
tokenizer = BasicTokenizer()
text = "aaabdaaabac"

ids = tokenizer.encode(text)
assert ids == [97, 97, 97, 98, 100, 97, 97, 97, 98, 97, 99]
assert tokenizer.decode(ids) == text

# three merges
# you can use vscode `Ctrl + D` to select all occurences of the same pair
# [97, 97, 97, 98, 100, 97, 97, 97, 98, 97, 99]
# aa -> [97, 97] -> 256
# [256, 97, 98, 100, 256, 97, 98, 97, 99]
# aaa -> [256, 97] -> 257
# [257, 98, 100, 257, 98, 97, 99]
# aaab -> [257, 98] -> 258
# [258, 100, 258, 97, 99]

tokenizer.train(text, 256 + 3, verbose=True) 
ids = tokenizer.encode(text)
assert ids == [258, 100, 258, 97, 99]

The result is as follows:

merge 1/3: (97, 97) -> 256 (b'aa') had 4 occurrences
merge 2/3: (256, 97) -> 257 (b'aaa') had 2 occurrences
merge 3/3: (257, 98) -> 258 (b'aaab') had 2 occurrences

As you can see, the result is inconsistent with Wikipedia and should be corrected as follows:

for 3 merges will result in string:
"XdXac"

where:
X=Yb # original is ZY
Y=Za # original is ab
Z=aa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant