You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
for i in range(32768):
print(i, bytes(sp.Decode([i]), 'utf-8'))
Here we observe 2 "issues":
80 pairs of tokens to represent the same single character strings (see the full list below).
We observed that these tokens group together into 2 sets that are close to each other in IDs: the 1st set has low IDs (almost at the start of the vocab) and the second one very high (almost at the end of the vocab).
What we observe in the tokenized dataset is that the tokenizer "decides" to use only one of these tokens always - usually the token with the higher ID.
128 tokens decode to the same b'\xef\xbf\xbd' unknown symbol. b'\xef\xbf\xbd' seems to be the unicode replacement character. But we don't understand why there would be one since we use --split_by_unicode_script=True. Or maybe those tokens are due to the --byte_fallback=True option?
Are both of these known and expected behaviors?
To us it seems counterintuitive and wasteful that the vocab should use 2 token encodings to encode the same symbol.
Thanks!
List of duplicate tokens:
b'X' - stands for the string decoding (result of the sp.Decode([i]) operation)
A with [B] means that the sp.decode result for tokens A and B is identical.
Duplicate tokens (sp.Decode=b' ') 35 with [32685]
Duplicate tokens (sp.Decode=b'!') 36 with [32760]
Duplicate tokens (sp.Decode=b'"') 37 with [32710]
Duplicate tokens (sp.Decode=b'#') 38 with [32759]
Duplicate tokens (sp.Decode=b"'") 42 with [32711]
Duplicate tokens (sp.Decode=b'(') 43 with [32743]
Duplicate tokens (sp.Decode=b')') 44 with [32742]
Duplicate tokens (sp.Decode=b'*') 45 with [32725]
Duplicate tokens (sp.Decode=b',') 47 with [32706]
Duplicate tokens (sp.Decode=b'-') 48 with [32723]
Duplicate tokens (sp.Decode=b'.') 49 with [32705]
Duplicate tokens (sp.Decode=b'/') 50 with [32761]
Duplicate tokens (sp.Decode=b'0') 51 with [32733]
Duplicate tokens (sp.Decode=b'1') 52 with [32718]
Duplicate tokens (sp.Decode=b'2') 53 with [32728]
Duplicate tokens (sp.Decode=b'3') 54 with [32746]
Duplicate tokens (sp.Decode=b'4') 55 with [32749]
Duplicate tokens (sp.Decode=b'5') 56 with [32750]
Duplicate tokens (sp.Decode=b'6') 57 with [32752]
Duplicate tokens (sp.Decode=b'7') 58 with [32753]
Duplicate tokens (sp.Decode=b'8') 59 with [32751]
Duplicate tokens (sp.Decode=b'9') 60 with [32741]
Duplicate tokens (sp.Decode=b':') 61 with [32737]
Duplicate tokens (sp.Decode=b';') 62 with [32754]
Duplicate tokens (sp.Decode=b'?') 66 with [32738]
Duplicate tokens (sp.Decode=b'A') 68 with [32714]
Duplicate tokens (sp.Decode=b'B') 69 with [32722]
Duplicate tokens (sp.Decode=b'C') 70 with [32719]
Duplicate tokens (sp.Decode=b'D') 71 with [32730]
Duplicate tokens (sp.Decode=b'E') 72 with [32726]
Duplicate tokens (sp.Decode=b'F') 73 with [32734]
Duplicate tokens (sp.Decode=b'G') 74 with [32735]
Duplicate tokens (sp.Decode=b'H') 75 with [32716]
Duplicate tokens (sp.Decode=b'I') 76 with [32712]
Duplicate tokens (sp.Decode=b'J') 77 with [32744]
Duplicate tokens (sp.Decode=b'K') 78 with [32755]
Duplicate tokens (sp.Decode=b'L') 79 with [32729]
Duplicate tokens (sp.Decode=b'M') 80 with [32720]
Duplicate tokens (sp.Decode=b'N') 81 with [32731]
Duplicate tokens (sp.Decode=b'O') 82 with [32739]
Duplicate tokens (sp.Decode=b'P') 83 with [32727]
Duplicate tokens (sp.Decode=b'Q') 84 with [32763]
Duplicate tokens (sp.Decode=b'R') 85 with [32732]
Duplicate tokens (sp.Decode=b'S') 86 with [32715]
Duplicate tokens (sp.Decode=b'T') 87 with [32713]
Duplicate tokens (sp.Decode=b'U') 88 with [32757]
Duplicate tokens (sp.Decode=b'V') 89 with [32758]
Duplicate tokens (sp.Decode=b'W') 90 with [32724]
Duplicate tokens (sp.Decode=b'Y') 92 with [32748]
Duplicate tokens (sp.Decode=b'Z') 93 with [32764]
Duplicate tokens (sp.Decode=b'[') 94 with [32767]
Duplicate tokens (sp.Decode=b']') 96 with [32766]
Duplicate tokens (sp.Decode=b'_') 98 with [32717]
Duplicate tokens (sp.Decode=b'a') 100 with [32688]
Duplicate tokens (sp.Decode=b'b') 101 with [32707]
Duplicate tokens (sp.Decode=b'c') 102 with [32698]
Duplicate tokens (sp.Decode=b'd') 103 with [32696]
Duplicate tokens (sp.Decode=b'e') 104 with [32686]
Duplicate tokens (sp.Decode=b'f') 105 with [32701]
Duplicate tokens (sp.Decode=b'g') 106 with [32700]
Duplicate tokens (sp.Decode=b'h') 107 with [32694]
Duplicate tokens (sp.Decode=b'i') 108 with [32691]
Duplicate tokens (sp.Decode=b'j') 109 with [32736]
Duplicate tokens (sp.Decode=b'k') 110 with [32709]
Duplicate tokens (sp.Decode=b'l') 111 with [32695]
Duplicate tokens (sp.Decode=b'm') 112 with [32699]
Duplicate tokens (sp.Decode=b'n') 113 with [32690]
Duplicate tokens (sp.Decode=b'o') 114 with [32689]
Duplicate tokens (sp.Decode=b'p') 115 with [32704]
Duplicate tokens (sp.Decode=b'q') 116 with [32745]
Duplicate tokens (sp.Decode=b'r') 117 with [32693]
Duplicate tokens (sp.Decode=b's') 118 with [32692]
Duplicate tokens (sp.Decode=b't') 119 with [32687]
Duplicate tokens (sp.Decode=b'u') 120 with [32697]
Duplicate tokens (sp.Decode=b'v') 121 with [32708]
Duplicate tokens (sp.Decode=b'w') 122 with [32702]
Duplicate tokens (sp.Decode=b'x') 123 with [32721]
Duplicate tokens (sp.Decode=b'y') 124 with [32703]
Duplicate tokens (sp.Decode=b'z') 125 with [32740]
Duplicate tokens (sp.Decode=b'|') 127 with [32762]
Duplicate tokens (sp.Decode=b'\xef\xbf\xbd') 131 with [132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258]
The text was updated successfully, but these errors were encountered:
This might be the problem.
In that case, those tokens are not identical.
One would encode a single character within the ascii range, the other might have the same byte representation but only used for unicode fall back.
The way sentencepiece works, the fallback tokens would not have a surface representation, so it would not be tokenized the same.
Hi,
we're observing an "issue" with the sentencepiece tokenizer, where multiple tokens have identical string decoding.
We've generated the vocab of size
32768
using thewikitext-103
dataset with the following code:After this we run the following inspection:
Here we observe 2 "issues":
We observed that these tokens group together into 2 sets that are close to each other in IDs: the 1st set has low IDs (almost at the start of the vocab) and the second one very high (almost at the end of the vocab).
What we observe in the tokenized dataset is that the tokenizer "decides" to use only one of these tokens always - usually the token with the higher ID.
b'\xef\xbf\xbd'
unknown symbol. b'\xef\xbf\xbd' seems to be the unicode replacement character. But we don't understand why there would be one since we use--split_by_unicode_script=True
. Or maybe those tokens are due to the--byte_fallback=True
option?Are both of these known and expected behaviors?
To us it seems counterintuitive and wasteful that the vocab should use 2 token encodings to encode the same symbol.
Thanks!
List of duplicate tokens:
b'X' - stands for the string decoding (result of the
sp.Decode([i])
operation)A with [B] means that the
sp.decode
result for tokens A and B is identical.The text was updated successfully, but these errors were encountered: