One problem in the annotations of `test_wikipedia_example` in the `tests/test_tokenizer` file #93

donglinkang2021 · 2024-11-18T03:04:10Z

Thanks Andrej for his great youtube vedio and this repo.
I think that there maybe some wrong from wikipedia annotation here, in the function test_wikipedia_example, it says

According to Wikipedia, running bpe on the input string:
"aaabdaaabac"

for 3 merges will result in string:
"XdXac"

where:
X=ZY
Y=ab
Z=aa

Keep in mind that for us a=97, b=98, c=99, d=100 (ASCII values)
so Z will be 256, Y will be 257, X will be 258.

So I manually verified the process myself, as follows:

Initial encoding:

[97, 97, 97, 98, 100, 97, 97, 97, 98, 97, 99]

First merge, for the (97, 97) has the most occurrences, so merge it to 256 (Z = aa -> 256):
```
[256, 97, 98, 100, 256, 97, 98, 97, 99]
```
Second merge, here (256, 97) and (97, 98) have the same most occurrences, merge the first one (Y = Za -> 257):
```
[257, 98, 100, 257, 98, 97, 99]
```
Third merge, here (257, 98) have the most occurrences, merge it to 258 (X = Yb -> 258):
```
[258, 100, 258, 97, 99]
```

Below is the verification code (fortunately, Andrej added a verbose parameter to .train(), so I don't need to write additional output printing code for verification)

from minbpe import BasicTokenizer
tokenizer = BasicTokenizer()
text = "aaabdaaabac"

ids = tokenizer.encode(text)
assert ids == [97, 97, 97, 98, 100, 97, 97, 97, 98, 97, 99]
assert tokenizer.decode(ids) == text

# three merges
# you can use vscode `Ctrl + D` to select all occurences of the same pair
# [97, 97, 97, 98, 100, 97, 97, 97, 98, 97, 99]
# aa -> [97, 97] -> 256
# [256, 97, 98, 100, 256, 97, 98, 97, 99]
# aaa -> [256, 97] -> 257
# [257, 98, 100, 257, 98, 97, 99]
# aaab -> [257, 98] -> 258
# [258, 100, 258, 97, 99]

tokenizer.train(text, 256 + 3, verbose=True) 
ids = tokenizer.encode(text)
assert ids == [258, 100, 258, 97, 99]

The result is as follows:

merge 1/3: (97, 97) -> 256 (b'aa') had 4 occurrences
merge 2/3: (256, 97) -> 257 (b'aaa') had 2 occurrences
merge 3/3: (257, 98) -> 258 (b'aaab') had 2 occurrences

As you can see, the result is inconsistent with Wikipedia and should be corrected as follows:

for 3 merges will result in string:
"XdXac"

where:
X=Yb # original is ZY
Y=Za # original is ab
Z=aa

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One problem in the annotations of `test_wikipedia_example` in the `tests/test_tokenizer` file #93

One problem in the annotations of `test_wikipedia_example` in the `tests/test_tokenizer` file #93

donglinkang2021 commented Nov 18, 2024

One problem in the annotations of test_wikipedia_example in the tests/test_tokenizer file #93

One problem in the annotations of test_wikipedia_example in the tests/test_tokenizer file #93

Comments

donglinkang2021 commented Nov 18, 2024

One problem in the annotations of `test_wikipedia_example` in the `tests/test_tokenizer` file #93

One problem in the annotations of `test_wikipedia_example` in the `tests/test_tokenizer` file #93