Duplicate tokens in BPE vocabulary #881

astanic · 2023-06-09T12:33:50Z

Hi,

we're observing an "issue" with the sentencepiece tokenizer, where multiple tokens have identical string decoding.

We've generated the vocab of size 32768 using the wikitext-103 dataset with the following code:

import sentencepiece as spm

spm.SentencePieceTrainer.Train(
    " --input=wikitext-103-raw/wiki.train.raw" +
    " --model_prefix=wiki_32768" +
    " --model_type=bpe" +
    " --vocab_size=32768" +
    " --hard_vocab_limit=True" +
    " --input_sentence_size=2000000" +
    " --unk_id=1" +
    " --bos_id=-1" +
    " --eos_id=-1" +
    " --pad_id=0" +
    " --max_sentencepiece_length=99" +
    " --split_by_unicode_script=True" +
    " --split_by_number=True" +
    " --split_by_whitespace=True" +
    " --add_dummy_prefix=False" +
    " --byte_fallback=True" +
    " --remove_extra_whitespaces=False" +
    " --allow_whitespace_only_pieces=True" +
    " --normalization_rule_name=identity" +
    " --user_defined_symbols=<D>" +
    " --split_digits=True" +
    " --vocabulary_output_piece_score=False")

After this we run the following inspection:

for i in range(32768):
    print(i, bytes(sp.Decode([i]), 'utf-8'))

Here we observe 2 "issues":

80 pairs of tokens to represent the same single character strings (see the full list below).
We observed that these tokens group together into 2 sets that are close to each other in IDs: the 1st set has low IDs (almost at the start of the vocab) and the second one very high (almost at the end of the vocab).
What we observe in the tokenized dataset is that the tokenizer "decides" to use only one of these tokens always - usually the token with the higher ID.
128 tokens decode to the same b'\xef\xbf\xbd' unknown symbol. b'\xef\xbf\xbd' seems to be the unicode replacement character. But we don't understand why there would be one since we use --split_by_unicode_script=True. Or maybe those tokens are due to the --byte_fallback=True option?

Are both of these known and expected behaviors?
To us it seems counterintuitive and wasteful that the vocab should use 2 token encodings to encode the same symbol.

Thanks!

List of duplicate tokens:
b'X' - stands for the string decoding (result of the sp.Decode([i]) operation)
A with [B] means that the sp.decode result for tokens A and B is identical.

Duplicate tokens (sp.Decode=b' ') 35 with [32685]
Duplicate tokens (sp.Decode=b'!') 36 with [32760]
Duplicate tokens (sp.Decode=b'"') 37 with [32710]
Duplicate tokens (sp.Decode=b'#') 38 with [32759]
Duplicate tokens (sp.Decode=b"'") 42 with [32711]
Duplicate tokens (sp.Decode=b'(') 43 with [32743]
Duplicate tokens (sp.Decode=b')') 44 with [32742]
Duplicate tokens (sp.Decode=b'*') 45 with [32725]
Duplicate tokens (sp.Decode=b',') 47 with [32706]
Duplicate tokens (sp.Decode=b'-') 48 with [32723]
Duplicate tokens (sp.Decode=b'.') 49 with [32705]
Duplicate tokens (sp.Decode=b'/') 50 with [32761]
Duplicate tokens (sp.Decode=b'0') 51 with [32733]
Duplicate tokens (sp.Decode=b'1') 52 with [32718]
Duplicate tokens (sp.Decode=b'2') 53 with [32728]
Duplicate tokens (sp.Decode=b'3') 54 with [32746]
Duplicate tokens (sp.Decode=b'4') 55 with [32749]
Duplicate tokens (sp.Decode=b'5') 56 with [32750]
Duplicate tokens (sp.Decode=b'6') 57 with [32752]
Duplicate tokens (sp.Decode=b'7') 58 with [32753]
Duplicate tokens (sp.Decode=b'8') 59 with [32751]
Duplicate tokens (sp.Decode=b'9') 60 with [32741]
Duplicate tokens (sp.Decode=b':') 61 with [32737]
Duplicate tokens (sp.Decode=b';') 62 with [32754]
Duplicate tokens (sp.Decode=b'?') 66 with [32738]
Duplicate tokens (sp.Decode=b'A') 68 with [32714]
Duplicate tokens (sp.Decode=b'B') 69 with [32722]
Duplicate tokens (sp.Decode=b'C') 70 with [32719]
Duplicate tokens (sp.Decode=b'D') 71 with [32730]
Duplicate tokens (sp.Decode=b'E') 72 with [32726]
Duplicate tokens (sp.Decode=b'F') 73 with [32734]
Duplicate tokens (sp.Decode=b'G') 74 with [32735]
Duplicate tokens (sp.Decode=b'H') 75 with [32716]
Duplicate tokens (sp.Decode=b'I') 76 with [32712]
Duplicate tokens (sp.Decode=b'J') 77 with [32744]
Duplicate tokens (sp.Decode=b'K') 78 with [32755]
Duplicate tokens (sp.Decode=b'L') 79 with [32729]
Duplicate tokens (sp.Decode=b'M') 80 with [32720]
Duplicate tokens (sp.Decode=b'N') 81 with [32731]
Duplicate tokens (sp.Decode=b'O') 82 with [32739]
Duplicate tokens (sp.Decode=b'P') 83 with [32727]
Duplicate tokens (sp.Decode=b'Q') 84 with [32763]
Duplicate tokens (sp.Decode=b'R') 85 with [32732]
Duplicate tokens (sp.Decode=b'S') 86 with [32715]
Duplicate tokens (sp.Decode=b'T') 87 with [32713]
Duplicate tokens (sp.Decode=b'U') 88 with [32757]
Duplicate tokens (sp.Decode=b'V') 89 with [32758]
Duplicate tokens (sp.Decode=b'W') 90 with [32724]
Duplicate tokens (sp.Decode=b'Y') 92 with [32748]
Duplicate tokens (sp.Decode=b'Z') 93 with [32764]
Duplicate tokens (sp.Decode=b'[') 94 with [32767]
Duplicate tokens (sp.Decode=b']') 96 with [32766]
Duplicate tokens (sp.Decode=b'_') 98 with [32717]
Duplicate tokens (sp.Decode=b'a') 100 with [32688]
Duplicate tokens (sp.Decode=b'b') 101 with [32707]
Duplicate tokens (sp.Decode=b'c') 102 with [32698]
Duplicate tokens (sp.Decode=b'd') 103 with [32696]
Duplicate tokens (sp.Decode=b'e') 104 with [32686]
Duplicate tokens (sp.Decode=b'f') 105 with [32701]
Duplicate tokens (sp.Decode=b'g') 106 with [32700]
Duplicate tokens (sp.Decode=b'h') 107 with [32694]
Duplicate tokens (sp.Decode=b'i') 108 with [32691]
Duplicate tokens (sp.Decode=b'j') 109 with [32736]
Duplicate tokens (sp.Decode=b'k') 110 with [32709]
Duplicate tokens (sp.Decode=b'l') 111 with [32695]
Duplicate tokens (sp.Decode=b'm') 112 with [32699]
Duplicate tokens (sp.Decode=b'n') 113 with [32690]
Duplicate tokens (sp.Decode=b'o') 114 with [32689]
Duplicate tokens (sp.Decode=b'p') 115 with [32704]
Duplicate tokens (sp.Decode=b'q') 116 with [32745]
Duplicate tokens (sp.Decode=b'r') 117 with [32693]
Duplicate tokens (sp.Decode=b's') 118 with [32692]
Duplicate tokens (sp.Decode=b't') 119 with [32687]
Duplicate tokens (sp.Decode=b'u') 120 with [32697]
Duplicate tokens (sp.Decode=b'v') 121 with [32708]
Duplicate tokens (sp.Decode=b'w') 122 with [32702]
Duplicate tokens (sp.Decode=b'x') 123 with [32721]
Duplicate tokens (sp.Decode=b'y') 124 with [32703]
Duplicate tokens (sp.Decode=b'z') 125 with [32740]
Duplicate tokens (sp.Decode=b'|') 127 with [32762]
Duplicate tokens (sp.Decode=b'\xef\xbf\xbd') 131 with [132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258]

The text was updated successfully, but these errors were encountered:

chris-ha458 · 2023-06-13T04:48:00Z

--byte_fallback=True"

This might be the problem.
In that case, those tokens are not identical.
One would encode a single character within the ascii range, the other might have the same byte representation but only used for unicode fall back.
The way sentencepiece works, the fallback tokens would not have a surface representation, so it would not be tokenized the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate tokens in BPE vocabulary #881

Duplicate tokens in BPE vocabulary #881

astanic commented Jun 9, 2023 •

edited

Loading

chris-ha458 commented Jun 13, 2023

Duplicate tokens in BPE vocabulary #881

Duplicate tokens in BPE vocabulary #881

Comments

astanic commented Jun 9, 2023 • edited Loading

chris-ha458 commented Jun 13, 2023

astanic commented Jun 9, 2023 •

edited

Loading