Mismatch in Vocabulary Size: Investigating Inconsistencies between Token-to-ID and ID-to-Token Dictionaries #413

nullhook · 2023-03-23T02:05:32Z

The total number of vocabulary items in the model file is 32k. When we parse them, there's a mismatch between token_to_id and id_to_token.

The size for token_to_id is: 31,903
The size for id_to_token is: 32,000

I'm curious on why there's a mismatch. Are there some token IDs that are reserved or errors during pre-processing?

anzz1 · 2023-03-23T02:35:53Z

Could this be related to the phenomenon of the so-called glitch tokens ? The research on those has been focused on GPT-3 and I've yet to find any information as to which extent this phenomenon exists in LLaMA models, or whether it even exists at all.

If I understood correctly, the root cause for the phenomenon is the existence of tokens in the pre-training data which then during the training phase do (almost) never occur again, resulting in a mismatch in the vocabulary of the trained model between available tokens and usable tokens, breaking/glitching the model when one of those "available, but not usable" tokens are encountered.

I'm just spitballing here though, maybe this isn't related at all.

wizd · 2023-03-23T04:18:55Z

Could this be related to the phenomenon of the so-called glitch tokens ? The research on those has been focused on GPT-3 and I've yet to find any information as to which extent this phenomenon exists in LLaMA models, or whether it even exists at all.

If I understood correctly, the root cause for the phenomenon is the existence of tokens in the pre-training data which then during the training phase do (almost) never occur again, resulting in a mismatch in the vocabulary of the trained model between available tokens and usable tokens, breaking/glitching the model when one of those "available, but not usable" tokens are encountered.

I'm just spitballing here though, maybe this isn't related at all.

Perhaps you are right. I have noticed that a lot of text related to Chinese has been incorrect until recently.

anzz1 · 2023-03-23T07:26:52Z

Perhaps you are right. I have noticed that a lot of text related to Chinese has been incorrect until recently.

That might be related, since the LLaMA model is supposed to be English-only and according to the LLaMA research paper^[1] was even filtered to remove non-English text. Obviously the filtering wasn't perfect since the model can produce (often very broken) multilingual output too.

However, the specific phenomenon I'm talking about here isn't incorrect or nonsensical output, but that very specific tokens completely break the model and it gets stuck in a loop and/or completely garbles the following outputs. The research on GPT3 has found out examples of invalid data like usernames and copy-pasted debug logs being a part of the pretraining data and ending up as "invalid" tokens in the model. "Invalid", for a lack of a better word, refers to a token which the model doesn't know what to do with.

You can google "SolidGoldMagikarp" (a reddit username) or "PsyNetMessage" (found in Rocket League debug logs) to find out more.

I'm not aware of any such tokens having been found in the LLaMA models yet, nor even if the same thing even happens with LLaMA models, but I guess we'll see soon enough.

[1]: LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023)

sw · 2023-03-23T11:43:27Z

Probably an old hard-coded value: #142?

nullhook · 2023-03-23T14:49:21Z

Looks like there are duplicate entries in the model file which is causing the token_to_id dictionary to have fewer unique entries.

Total count for dupes is 97 which matches the difference of both dictionaries.

Duplicate word: '', ID: 2, count: 1
Duplicate word: ' ', ID: 29871, count: 2
Duplicate word: 'e', ID: 29872, count: 3
Duplicate word: 't', ID: 29873, count: 4
Duplicate word: 'a', ID: 29874, count: 5
Duplicate word: 'i', ID: 29875, count: 6
Duplicate word: 'n', ID: 29876, count: 7
Duplicate word: 'o', ID: 29877, count: 8
Duplicate word: 'r', ID: 29878, count: 9
Duplicate word: 's', ID: 29879, count: 10
Duplicate word: 'l', ID: 29880, count: 11
Duplicate word: 'd', ID: 29881, count: 12
Duplicate word: 'h', ID: 29882, count: 13
Duplicate word: 'c', ID: 29883, count: 14
Duplicate word: 'u', ID: 29884, count: 15
Duplicate word: 'm', ID: 29885, count: 16
Duplicate word: 'p', ID: 29886, count: 17
Duplicate word: 'g', ID: 29887, count: 18
Duplicate word: 'f', ID: 29888, count: 19
Duplicate word: '.', ID: 29889, count: 20
Duplicate word: 'b', ID: 29890, count: 21
Duplicate word: 'y', ID: 29891, count: 22
Duplicate word: ',', ID: 29892, count: 23
Duplicate word: 'w', ID: 29893, count: 24
Duplicate word: 'v', ID: 29894, count: 25
Duplicate word: 'k', ID: 29895, count: 26
Duplicate word: '1', ID: 29896, count: 27
Duplicate word: ')', ID: 29897, count: 28
Duplicate word: '(', ID: 29898, count: 29
Duplicate word: '-', ID: 29899, count: 30
Duplicate word: '0', ID: 29900, count: 31
Duplicate word: ':', ID: 29901, count: 32
Duplicate word: 'I', ID: 29902, count: 33
Duplicate word: 'S', ID: 29903, count: 34
Duplicate word: '\', ID: 29905, count: 35
Duplicate word: '2', ID: 29906, count: 36
Duplicate word: 'C', ID: 29907, count: 37
Duplicate word: '"', ID: 29908, count: 38
Duplicate word: 'A', ID: 29909, count: 39
Duplicate word: 'T', ID: 29911, count: 40
Duplicate word: '{', ID: 29912, count: 41
Duplicate word: '}', ID: 29913, count: 42
Duplicate word: '/', ID: 29914, count: 43
Duplicate word: ''', ID: 29915, count: 44
Duplicate word: 'x', ID: 29916, count: 45
Duplicate word: '_', ID: 29918, count: 46
Duplicate word: 'z', ID: 29920, count: 47
Duplicate word: '=', ID: 29922, count: 48
Duplicate word: 'E', ID: 29923, count: 49
Duplicate word: 'M', ID: 29924, count: 50
Duplicate word: 'P', ID: 29925, count: 51
Duplicate word: 'j', ID: 29926, count: 52
Duplicate word: 'D', ID: 29928, count: 53
Duplicate word: '9', ID: 29929, count: 54
Duplicate word: '*', ID: 29930, count: 55
Duplicate word: 'L', ID: 29931, count: 56
Duplicate word: 'B', ID: 29933, count: 57
Duplicate word: 'R', ID: 29934, count: 58
Duplicate word: ';', ID: 29936, count: 59
Duplicate word: '#', ID: 29937, count: 60
Duplicate word: '$', ID: 29938, count: 61
Duplicate word: 'q', ID: 29939, count: 62
Duplicate word: 'N', ID: 29940, count: 63
Duplicate word: '3', ID: 29941, count: 64
Duplicate word: 'F', ID: 29943, count: 65
Duplicate word: '5', ID: 29945, count: 66
Duplicate word: '4', ID: 29946, count: 67
Duplicate word: '8', ID: 29947, count: 68
Duplicate word: 'O', ID: 29949, count: 69
Duplicate word: 'H', ID: 29950, count: 70
Duplicate word: '`', ID: 29952, count: 71
Duplicate word: '6', ID: 29953, count: 72
Duplicate word: 'G', ID: 29954, count: 73
Duplicate word: '7', ID: 29955, count: 74
Duplicate word: 'W', ID: 29956, count: 75
Duplicate word: '>', ID: 29958, count: 76
Duplicate word: '[', ID: 29961, count: 77
Duplicate word: ']', ID: 29962, count: 78
Duplicate word: 'V', ID: 29963, count: 79
Duplicate word: 'U', ID: 29965, count: 80
Duplicate word: '<', ID: 29966, count: 81
Duplicate word: 'J', ID: 29967, count: 82
Duplicate word: 'K', ID: 29968, count: 83
Duplicate word: '?', ID: 29973, count: 84
Duplicate word: '+', ID: 29974, count: 85
Duplicate word: 'Y', ID: 29979, count: 86
Duplicate word: 'Q', ID: 29984, count: 87
Duplicate word: '^', ID: 29985, count: 88
Duplicate word: '&', ID: 29987, count: 89
Duplicate word: '|', ID: 29989, count: 90
Duplicate word: 'X', ID: 29990, count: 91
Duplicate word: '!', ID: 29991, count: 92
Duplicate word: '@', ID: 29992, count: 93
Duplicate word: '%', ID: 29995, count: 94
Duplicate word: 'Z', ID: 29999, count: 95
', ID: 30004, count: 96
Duplicate word: '~', ID: 30022, count: 97

j-f1 · 2023-03-23T15:01:44Z

Yep, looks like the ASCII set got duplicated somehow.

Green-Sky · 2023-03-23T21:04:41Z

I think this might be the "bytes" fallback, if something (like special unicode) could not be parsed otherwise.

j-f1 · 2023-03-23T23:39:04Z

Not quite — those values are, I believe, at the beginning of the token set while these ones show up at the end.

goerch · 2023-10-06T09:55:25Z

At least we are aware of the problem. We currently have this assertion.

github-actions · 2024-04-10T01:08:00Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

gjmulder added the question Further information is requested label Mar 23, 2023

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch in Vocabulary Size: Investigating Inconsistencies between Token-to-ID and ID-to-Token Dictionaries #413

Mismatch in Vocabulary Size: Investigating Inconsistencies between Token-to-ID and ID-to-Token Dictionaries #413

nullhook commented Mar 23, 2023

anzz1 commented Mar 23, 2023

wizd commented Mar 23, 2023

anzz1 commented Mar 23, 2023

sw commented Mar 23, 2023

nullhook commented Mar 23, 2023 •

edited

Loading

j-f1 commented Mar 23, 2023

Green-Sky commented Mar 23, 2023

j-f1 commented Mar 23, 2023

goerch commented Oct 6, 2023

github-actions bot commented Apr 10, 2024

Mismatch in Vocabulary Size: Investigating Inconsistencies between Token-to-ID and ID-to-Token Dictionaries #413

Mismatch in Vocabulary Size: Investigating Inconsistencies between Token-to-ID and ID-to-Token Dictionaries #413

Comments

nullhook commented Mar 23, 2023

anzz1 commented Mar 23, 2023

wizd commented Mar 23, 2023

anzz1 commented Mar 23, 2023

sw commented Mar 23, 2023

nullhook commented Mar 23, 2023 • edited Loading

j-f1 commented Mar 23, 2023

Green-Sky commented Mar 23, 2023

j-f1 commented Mar 23, 2023

goerch commented Oct 6, 2023

github-actions bot commented Apr 10, 2024

nullhook commented Mar 23, 2023 •

edited

Loading