Warning: Bad token in vocab at index xxx #11

CheatCod · 2023-03-15T10:05:13Z

running cargo run --release -- -m ~/dev/llama.cpp/models/7B/ggml-model-f16.bin -f prompt gives a bunch of "Warning: Bad token in vocab at index..."

The path points to ggml converted llama model, which I have verified that they work with llama.cpp

The text was updated successfully, but these errors were encountered:

setzer22 · 2023-03-15T10:23:38Z

Thanks for reporting @CheatCod! We are aware of the issue. Apparently, some models have tokens in the embedded vocabulary that use invalid UTF-8 codepoints. This is an error, but not an irrecoverable one. When detecting this, we simply print the warning and replace the broken tokens with the replacement character. The C++ code, on the other hand, simply ignores the issue (C++ strings are just byte arrays, while in Rust, they're required to hold valid UTF-8), but I thought it would be a good idea to at least warn the users.

There is some discussion about this at #3. Some of us don't see the errors, while others do. If I had to take a guess, this is something in the convert-pth-to-ggml.py script which, in one case is happily writing the invalid codepoints, and in other cases (like me, and others who reported the same), is substituting the invalid characters by the replacement character, so my model never had the broken utf-8 in the first place.

This could be due to a combination of the environment, OS, Python version, version of the conversion script, and who knows what else 🤔

mwbryant · 2023-03-15T15:06:30Z

Hopefully this will be less intrusive after #10 merges and we have a better logging crate. Maybe it shouldn't even be a warn! and should be an info! (or even not print at all) because it really doesn't hurt anything according to @setzer22's digging into the actual token.

setzer22 · 2023-03-15T15:28:01Z

Yup, I'd say turning it into info would be nice to avoid log spam, since the issue seems to be quite common.

philpax · 2023-03-16T01:41:24Z

I've downgraded it to info in #10 in the CLI (which now controls the logs).

setzer22 · 2023-03-20T20:47:09Z

Not sure what to do about this. We often get reports about this issue, so leaving it open sounds like a good idea to avoid duplicates.

But also, there's very little we can do about those non-utf8 tokens in the vocabulary, they are most certainly an error and affect lots of users.

So, what should we do? One option is to downgrade the error even further down to trace level. Another is to simply leave things as is, since people can use the RUST_LOG env variable to ignore all info messages. But people will probably still want to see the other loading messages.

philpax · 2023-03-20T23:02:56Z

I think leaving it as-is is fine: we can hopefully figure out what's generating those invalid tokens when we get to #21.

mwbryant · 2023-03-20T23:18:04Z

I think it's probably safe to completely remove the print or make it a 1 time thing. From your research we are already replacing the tokens with the correct unicode character and cpp version of this code doesn't handle them in any special way (and I've noticed the cpp code sometimes segfaults for me but this library never does). It seems to be caused upstream from llama-rs and we have already solved it without any more reported problems related except the print outs.

philpax · 2023-03-24T17:23:42Z

I'd like to investigate what's causing this. My feeling is that while the tokens themselves are not valid UTF-8, their use in the generated output is (e.g. two tokens form a valid string). I'm also curious if the newest llama.cpp tokeniser addresses this.

philpax · 2023-04-06T11:05:04Z

I checked my theory and it appears to be correct - the tokens are not guaranteed to be valid UTF-8 by themselves.

With invalid tokens enabled:

llama-rs # cargo run --release --bin llama-cli -- infer -m ../llama-models/v0/ggml-alpaca-7b-q4.bin -f examples/alpaca_prompt.txt -p "How do I say 'This is a complex sentence that will require some Unicode' in Japanese kanji?" --seed 943589183
[...]

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

How do I say 'This is a complex sentence that will require some Unicode' in Japanese kanji?

### Response:
この������は、一つでも文字化けな������を持たせる必要があります。

With raw bytes:

llama-rs # cargo run --release --bin llama-cli -- infer -m ../llama-models/v0/ggml-alpaca-7b-q4.bin -f examples/alpaca_prompt.txt -p "How do I say 'This is a complex sentence that will require some Unicode' in Japanese kanji?" --seed 943589183
[...]

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

How do I say 'This is a complex sentence that will require some Unicode' in Japanese kanji?

### Response:
この試験は、一つでも文字化けな困險を持たせる必要があります。

I'm thinking we should return the raw byte slices, but offer a helper to coalesce tokens until they form valid UTF-8.

philpax · 2023-04-07T14:43:19Z

After some discussion on the Discord, we came to the conclusion that the current behaviour is in fact a bug. The LLM infers over byte-tokens, not UTF8-tokens.

That being said, we'd still like to make using the library easy, so I'm going to switch everything over to use &[u8] except for inference_with_token, which will buffer tokens until they form valid UTF-8. I'll also expose the adapter it uses for this, so that users can do something similar when they fetch the byte-tokens themselves.

philpax added the issue:bug Something isn't working label Mar 24, 2023

philpax mentioned this issue Apr 6, 2023

Add GGJT loader #114

Closed

philpax self-assigned this Apr 7, 2023

This was referenced Apr 7, 2023

Use the HuggingFace llama Tokenizer #35

Closed

fix(llama): buffer tokens until valid UTF-8 #122

Merged

philpax closed this as completed Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning: Bad token in vocab at index xxx #11

Warning: Bad token in vocab at index xxx #11

CheatCod commented Mar 15, 2023

setzer22 commented Mar 15, 2023 •

edited

Loading

mwbryant commented Mar 15, 2023

setzer22 commented Mar 15, 2023

philpax commented Mar 16, 2023

setzer22 commented Mar 20, 2023

philpax commented Mar 20, 2023

mwbryant commented Mar 20, 2023

philpax commented Mar 24, 2023

philpax commented Apr 6, 2023

philpax commented Apr 7, 2023

Warning: Bad token in vocab at index xxx #11

Warning: Bad token in vocab at index xxx #11

Comments

CheatCod commented Mar 15, 2023

setzer22 commented Mar 15, 2023 • edited Loading

mwbryant commented Mar 15, 2023

setzer22 commented Mar 15, 2023

philpax commented Mar 16, 2023

setzer22 commented Mar 20, 2023

philpax commented Mar 20, 2023

mwbryant commented Mar 20, 2023

philpax commented Mar 24, 2023

philpax commented Apr 6, 2023

philpax commented Apr 7, 2023

setzer22 commented Mar 15, 2023 •

edited

Loading