-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Mismatch in Vocabulary Size: Investigating Inconsistencies between Token-to-ID and ID-to-Token Dictionaries #413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could this be related to the phenomenon of the so-called glitch tokens ? The research on those has been focused on GPT-3 and I've yet to find any information as to which extent this phenomenon exists in LLaMA models, or whether it even exists at all. If I understood correctly, the root cause for the phenomenon is the existence of tokens in the pre-training data which then during the training phase do (almost) never occur again, resulting in a mismatch in the vocabulary of the trained model between available tokens and usable tokens, breaking/glitching the model when one of those "available, but not usable" tokens are encountered. I'm just spitballing here though, maybe this isn't related at all. |
Perhaps you are right. I have noticed that a lot of text related to Chinese has been incorrect until recently. |
That might be related, since the LLaMA model is supposed to be English-only and according to the LLaMA research paper[1] was even filtered to remove non-English text. Obviously the filtering wasn't perfect since the model can produce (often very broken) multilingual output too. However, the specific phenomenon I'm talking about here isn't incorrect or nonsensical output, but that very specific tokens completely break the model and it gets stuck in a loop and/or completely garbles the following outputs. The research on GPT3 has found out examples of invalid data like usernames and copy-pasted debug logs being a part of the pretraining data and ending up as "invalid" tokens in the model. "Invalid", for a lack of a better word, refers to a token which the model doesn't know what to do with. You can google "SolidGoldMagikarp" (a reddit username) or "PsyNetMessage" (found in Rocket League debug logs) to find out more. I'm not aware of any such tokens having been found in the LLaMA models yet, nor even if the same thing even happens with LLaMA models, but I guess we'll see soon enough. [1]: LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023) |
Probably an old hard-coded value: #142? |
Looks like there are duplicate entries in the model file which is causing the token_to_id dictionary to have fewer unique entries. Total count for dupes is 97 which matches the difference of both dictionaries.
|
Yep, looks like the ASCII set got duplicated somehow. |
I think this might be the "bytes" fallback, if something (like special unicode) could not be parsed otherwise. |
Not quite — those values are, I believe, at the beginning of the token set while these ones show up at the end. |
At least we are aware of the problem. We currently have this assertion. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
The total number of vocabulary items in the model file is 32k. When we parse them, there's a mismatch between token_to_id and id_to_token.
The size for token_to_id is: 31,903
The size for id_to_token is: 32,000
I'm curious on why there's a mismatch. Are there some token IDs that are reserved or errors during pre-processing?
The text was updated successfully, but these errors were encountered: