Skip to content

Mismatch in Vocabulary Size: Investigating Inconsistencies between Token-to-ID and ID-to-Token Dictionaries #413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nullhook opened this issue Mar 23, 2023 · 10 comments
Labels
question Further information is requested stale

Comments

@nullhook
Copy link

The total number of vocabulary items in the model file is 32k. When we parse them, there's a mismatch between token_to_id and id_to_token.

The size for token_to_id is: 31,903
The size for id_to_token is: 32,000

I'm curious on why there's a mismatch. Are there some token IDs that are reserved or errors during pre-processing?

@anzz1
Copy link
Contributor

anzz1 commented Mar 23, 2023

Could this be related to the phenomenon of the so-called glitch tokens ? The research on those has been focused on GPT-3 and I've yet to find any information as to which extent this phenomenon exists in LLaMA models, or whether it even exists at all.

If I understood correctly, the root cause for the phenomenon is the existence of tokens in the pre-training data which then during the training phase do (almost) never occur again, resulting in a mismatch in the vocabulary of the trained model between available tokens and usable tokens, breaking/glitching the model when one of those "available, but not usable" tokens are encountered.

I'm just spitballing here though, maybe this isn't related at all.

@wizd
Copy link

wizd commented Mar 23, 2023

Could this be related to the phenomenon of the so-called glitch tokens ? The research on those has been focused on GPT-3 and I've yet to find any information as to which extent this phenomenon exists in LLaMA models, or whether it even exists at all.

If I understood correctly, the root cause for the phenomenon is the existence of tokens in the pre-training data which then during the training phase do (almost) never occur again, resulting in a mismatch in the vocabulary of the trained model between available tokens and usable tokens, breaking/glitching the model when one of those "available, but not usable" tokens are encountered.

I'm just spitballing here though, maybe this isn't related at all.

Perhaps you are right. I have noticed that a lot of text related to Chinese has been incorrect until recently.

@anzz1
Copy link
Contributor

anzz1 commented Mar 23, 2023

Perhaps you are right. I have noticed that a lot of text related to Chinese has been incorrect until recently.

That might be related, since the LLaMA model is supposed to be English-only and according to the LLaMA research paper[1] was even filtered to remove non-English text. Obviously the filtering wasn't perfect since the model can produce (often very broken) multilingual output too.

However, the specific phenomenon I'm talking about here isn't incorrect or nonsensical output, but that very specific tokens completely break the model and it gets stuck in a loop and/or completely garbles the following outputs. The research on GPT3 has found out examples of invalid data like usernames and copy-pasted debug logs being a part of the pretraining data and ending up as "invalid" tokens in the model. "Invalid", for a lack of a better word, refers to a token which the model doesn't know what to do with.

You can google "SolidGoldMagikarp" (a reddit username) or "PsyNetMessage" (found in Rocket League debug logs) to find out more.

I'm not aware of any such tokens having been found in the LLaMA models yet, nor even if the same thing even happens with LLaMA models, but I guess we'll see soon enough.

[1]: LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023)

@gjmulder gjmulder added the question Further information is requested label Mar 23, 2023
@sw
Copy link
Contributor

sw commented Mar 23, 2023

Probably an old hard-coded value: #142?

@nullhook
Copy link
Author

nullhook commented Mar 23, 2023

Looks like there are duplicate entries in the model file which is causing the token_to_id dictionary to have fewer unique entries.

Total count for dupes is 97 which matches the difference of both dictionaries.

Duplicate word: '', ID: 2, count: 1
Duplicate word: ' ', ID: 29871, count: 2
Duplicate word: 'e', ID: 29872, count: 3
Duplicate word: 't', ID: 29873, count: 4
Duplicate word: 'a', ID: 29874, count: 5
Duplicate word: 'i', ID: 29875, count: 6
Duplicate word: 'n', ID: 29876, count: 7
Duplicate word: 'o', ID: 29877, count: 8
Duplicate word: 'r', ID: 29878, count: 9
Duplicate word: 's', ID: 29879, count: 10
Duplicate word: 'l', ID: 29880, count: 11
Duplicate word: 'd', ID: 29881, count: 12
Duplicate word: 'h', ID: 29882, count: 13
Duplicate word: 'c', ID: 29883, count: 14
Duplicate word: 'u', ID: 29884, count: 15
Duplicate word: 'm', ID: 29885, count: 16
Duplicate word: 'p', ID: 29886, count: 17
Duplicate word: 'g', ID: 29887, count: 18
Duplicate word: 'f', ID: 29888, count: 19
Duplicate word: '.', ID: 29889, count: 20
Duplicate word: 'b', ID: 29890, count: 21
Duplicate word: 'y', ID: 29891, count: 22
Duplicate word: ',', ID: 29892, count: 23
Duplicate word: 'w', ID: 29893, count: 24
Duplicate word: 'v', ID: 29894, count: 25
Duplicate word: 'k', ID: 29895, count: 26
Duplicate word: '1', ID: 29896, count: 27
Duplicate word: ')', ID: 29897, count: 28
Duplicate word: '(', ID: 29898, count: 29
Duplicate word: '-', ID: 29899, count: 30
Duplicate word: '0', ID: 29900, count: 31
Duplicate word: ':', ID: 29901, count: 32
Duplicate word: 'I', ID: 29902, count: 33
Duplicate word: 'S', ID: 29903, count: 34
Duplicate word: '\', ID: 29905, count: 35
Duplicate word: '2', ID: 29906, count: 36
Duplicate word: 'C', ID: 29907, count: 37
Duplicate word: '"', ID: 29908, count: 38
Duplicate word: 'A', ID: 29909, count: 39
Duplicate word: 'T', ID: 29911, count: 40
Duplicate word: '{', ID: 29912, count: 41
Duplicate word: '}', ID: 29913, count: 42
Duplicate word: '/', ID: 29914, count: 43
Duplicate word: ''', ID: 29915, count: 44
Duplicate word: 'x', ID: 29916, count: 45
Duplicate word: '_', ID: 29918, count: 46
Duplicate word: 'z', ID: 29920, count: 47
Duplicate word: '=', ID: 29922, count: 48
Duplicate word: 'E', ID: 29923, count: 49
Duplicate word: 'M', ID: 29924, count: 50
Duplicate word: 'P', ID: 29925, count: 51
Duplicate word: 'j', ID: 29926, count: 52
Duplicate word: 'D', ID: 29928, count: 53
Duplicate word: '9', ID: 29929, count: 54
Duplicate word: '*', ID: 29930, count: 55
Duplicate word: 'L', ID: 29931, count: 56
Duplicate word: 'B', ID: 29933, count: 57
Duplicate word: 'R', ID: 29934, count: 58
Duplicate word: ';', ID: 29936, count: 59
Duplicate word: '#', ID: 29937, count: 60
Duplicate word: '$', ID: 29938, count: 61
Duplicate word: 'q', ID: 29939, count: 62
Duplicate word: 'N', ID: 29940, count: 63
Duplicate word: '3', ID: 29941, count: 64
Duplicate word: 'F', ID: 29943, count: 65
Duplicate word: '5', ID: 29945, count: 66
Duplicate word: '4', ID: 29946, count: 67
Duplicate word: '8', ID: 29947, count: 68
Duplicate word: 'O', ID: 29949, count: 69
Duplicate word: 'H', ID: 29950, count: 70
Duplicate word: '`', ID: 29952, count: 71
Duplicate word: '6', ID: 29953, count: 72
Duplicate word: 'G', ID: 29954, count: 73
Duplicate word: '7', ID: 29955, count: 74
Duplicate word: 'W', ID: 29956, count: 75
Duplicate word: '>', ID: 29958, count: 76
Duplicate word: '[', ID: 29961, count: 77
Duplicate word: ']', ID: 29962, count: 78
Duplicate word: 'V', ID: 29963, count: 79
Duplicate word: 'U', ID: 29965, count: 80
Duplicate word: '<', ID: 29966, count: 81
Duplicate word: 'J', ID: 29967, count: 82
Duplicate word: 'K', ID: 29968, count: 83
Duplicate word: '?', ID: 29973, count: 84
Duplicate word: '+', ID: 29974, count: 85
Duplicate word: 'Y', ID: 29979, count: 86
Duplicate word: 'Q', ID: 29984, count: 87
Duplicate word: '^', ID: 29985, count: 88
Duplicate word: '&', ID: 29987, count: 89
Duplicate word: '|', ID: 29989, count: 90
Duplicate word: 'X', ID: 29990, count: 91
Duplicate word: '!', ID: 29991, count: 92
Duplicate word: '@', ID: 29992, count: 93
Duplicate word: '%', ID: 29995, count: 94
Duplicate word: 'Z', ID: 29999, count: 95
', ID: 30004, count: 96
Duplicate word: '~', ID: 30022, count: 97

@j-f1
Copy link
Collaborator

j-f1 commented Mar 23, 2023

Yep, looks like the ASCII set got duplicated somehow.

@Green-Sky
Copy link
Collaborator

I think this might be the "bytes" fallback, if something (like special unicode) could not be parsed otherwise.

@j-f1
Copy link
Collaborator

j-f1 commented Mar 23, 2023

Not quite — those values are, I believe, at the beginning of the token set while these ones show up at the end.

@goerch
Copy link
Collaborator

goerch commented Oct 6, 2023

At least we are aware of the problem. We currently have this assertion.

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale
Projects
None yet
Development

No branches or pull requests

8 participants