-
Notifications
You must be signed in to change notification settings - Fork 368
Warning: Bad token in vocab at index xxx #11
Comments
Thanks for reporting @CheatCod! We are aware of the issue. Apparently, some models have tokens in the embedded vocabulary that use invalid UTF-8 codepoints. This is an error, but not an irrecoverable one. When detecting this, we simply print the warning and replace the broken tokens with the replacement character. The C++ code, on the other hand, simply ignores the issue (C++ strings are just byte arrays, while in Rust, they're required to hold valid UTF-8), but I thought it would be a good idea to at least warn the users. There is some discussion about this at #3. Some of us don't see the errors, while others do. If I had to take a guess, this is something in the This could be due to a combination of the environment, OS, Python version, version of the conversion script, and who knows what else 🤔 |
Yup, I'd say turning it into |
I've downgraded it to |
Not sure what to do about this. We often get reports about this issue, so leaving it open sounds like a good idea to avoid duplicates. But also, there's very little we can do about those non-utf8 tokens in the vocabulary, they are most certainly an error and affect lots of users. So, what should we do? One option is to downgrade the error even further down to |
I think leaving it as-is is fine: we can hopefully figure out what's generating those invalid tokens when we get to #21. |
I think it's probably safe to completely remove the print or make it a 1 time thing. From your research we are already replacing the tokens with the correct unicode character and cpp version of this code doesn't handle them in any special way (and I've noticed the cpp code sometimes segfaults for me but this library never does). It seems to be caused upstream from llama-rs and we have already solved it without any more reported problems related except the print outs. |
I'd like to investigate what's causing this. My feeling is that while the tokens themselves are not valid UTF-8, their use in the generated output is (e.g. two tokens form a valid string). I'm also curious if the newest llama.cpp tokeniser addresses this. |
I checked my theory and it appears to be correct - the tokens are not guaranteed to be valid UTF-8 by themselves. With invalid tokens enabled:
With raw bytes:
I'm thinking we should return the raw byte slices, but offer a helper to coalesce tokens until they form valid UTF-8. |
After some discussion on the Discord, we came to the conclusion that the current behaviour is in fact a bug. The LLM infers over byte-tokens, not UTF8-tokens. That being said, we'd still like to make using the library easy, so I'm going to switch everything over to use |
running
cargo run --release -- -m ~/dev/llama.cpp/models/7B/ggml-model-f16.bin -f prompt
gives a bunch of "Warning: Bad token in vocab at index..."The path points to ggml converted llama model, which I have verified that they work with llama.cpp
The text was updated successfully, but these errors were encountered: