You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks Andrej for his great youtube vedio and this repo.
I think that there maybe some wrong from wikipedia annotation here, in the function test_wikipedia_example, it says
According to Wikipedia, running bpe on the input string:
"aaabdaaabac"
for 3 merges will result in string:
"XdXac"
where:
X=ZY
Y=ab
Z=aa
Keep in mind that for us a=97, b=98, c=99, d=100 (ASCII values)
so Z will be 256, Y will be 257, X will be 258.
So I manually verified the process myself, as follows:
Initial encoding:
[97, 97, 97, 98, 100, 97, 97, 97, 98, 97, 99]
First merge, for the (97, 97) has the most occurrences, so merge it to 256 (Z = aa -> 256):
[256, 97, 98, 100, 256, 97, 98, 97, 99]
Second merge, here (256, 97) and (97, 98) have the same most occurrences, merge the first one (Y = Za -> 257):
[257, 98, 100, 257, 98, 97, 99]
Third merge, here (257, 98) have the most occurrences, merge it to 258 (X = Yb -> 258):
[258, 100, 258, 97, 99]
Below is the verification code (fortunately, Andrej added a verbose parameter to .train(), so I don't need to write additional output printing code for verification)
fromminbpeimportBasicTokenizertokenizer=BasicTokenizer()
text="aaabdaaabac"ids=tokenizer.encode(text)
assertids== [97, 97, 97, 98, 100, 97, 97, 97, 98, 97, 99]
asserttokenizer.decode(ids) ==text# three merges# you can use vscode `Ctrl + D` to select all occurences of the same pair# [97, 97, 97, 98, 100, 97, 97, 97, 98, 97, 99]# aa -> [97, 97] -> 256# [256, 97, 98, 100, 256, 97, 98, 97, 99]# aaa -> [256, 97] -> 257# [257, 98, 100, 257, 98, 97, 99]# aaab -> [257, 98] -> 258# [258, 100, 258, 97, 99]tokenizer.train(text, 256+3, verbose=True)
ids=tokenizer.encode(text)
assertids== [258, 100, 258, 97, 99]
The result is as follows:
merge 1/3: (97, 97) -> 256 (b'aa') had 4 occurrences
merge 2/3: (256, 97) -> 257 (b'aaa') had 2 occurrences
merge 3/3: (257, 98) -> 258 (b'aaab') had 2 occurrences
As you can see, the result is inconsistent with Wikipedia and should be corrected as follows:
for 3 merges will result in string:
"XdXac"
where:
X=Yb # original is ZY
Y=Za # original is ab
Z=aa
The text was updated successfully, but these errors were encountered:
Thanks Andrej for his great youtube vedio and this repo.
I think that there maybe some wrong from wikipedia annotation here, in the function
test_wikipedia_example
, it saysSo I manually verified the process myself, as follows:
Initial encoding:
First merge, for the (97, 97) has the most occurrences, so merge it to 256 (
Z = aa
-> 256):Second merge, here (256, 97) and (97, 98) have the same most occurrences, merge the first one (
Y = Za
-> 257):Third merge, here (257, 98) have the most occurrences, merge it to 258 (
X = Yb
-> 258):Below is the verification code (fortunately, Andrej added a verbose parameter to
.train()
, so I don't need to write additional output printing code for verification)The result is as follows:
As you can see, the result is inconsistent with Wikipedia and should be corrected as follows:
The text was updated successfully, but these errors were encountered: